unicode-data: Access Unicode Character Database (UCD)

[ apache, data, library, text, unicode ] [ Propose Tags ] [ Report a vulnerability ]

unicode-data provides Haskell APIs to efficiently access the Unicode character database (UCD). Performance is the primary goal in the design of this package.

The Haskell data structures are generated programmatically from the UCD files. The latest Unicode version supported by this library is 15.1.0.

[Skip to Readme]


[Index] [Quick Jump]


Manual Flags


Use ICU for test and benchmark. Intended for development on the repository.


Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info


Note: This package has metadata revisions in the cabal description newer than included in the tarball. To unpack the package including the revisions, use 'cabal get'.

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees


  • No Candidates
Versions [RSS] 0.1.0,, 0.2.0, 0.3.0, 0.3.1, 0.4.0,, 0.5.0, 0.6.0 (info)
Change log Changelog.md
Dependencies base (>=4.7 && <4.22), ghc-prim [details]
Tested with ghc ==8.0.2, ghc ==8.2.2, ghc ==8.4.4, ghc ==8.6.5, ghc ==8.8.4, ghc ==8.10.7, ghc ==9.0.2, ghc ==9.2.8, ghc ==9.4.8, ghc ==9.6.5, ghc ==9.8.2, ghc ==9.10.1
License Apache-2.0
Copyright 2020 Composewell Technologies and Contributors
Author Composewell Technologies and Contributors
Maintainer streamly@composewell.com
Revised Revision 2 made by adithyaov at 2024-10-26T18:30:11Z
Category Data, Text, Unicode
Home page http://github.com/composewell/unicode-data
Bug tracker https://github.com/composewell/unicode-data/issues
Source repo head: git clone https://github.com/composewell/unicode-data
Uploaded by wismill at 2024-07-03T14:37:14Z
Distributions Arch:, Fedora:0.3.1, LTSHaskell:0.6.0, NixOS:, Stackage:0.6.0, openSUSE:
Reverse Dependencies 6 direct, 229 indirect [details]
Downloads 14583 total (321 in the last 30 days)
Rating 2.25 (votes: 2) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs available [build log]
Last success reported on 2024-07-03 [all 1 reports]

Readme for unicode-data-0.6.0

[back to package description]


unicode-data provides Haskell APIs to efficiently access the Unicode character database. Performance is the primary goal in the design of this package.

The Haskell data structures are generated programmatically from the Unicode character database (UCD) files. The latest Unicode version supported by this library is 15.1.0.

Please see the Haddock documentation for reference documentation.


unicode-data is up to 5 times faster than base ≤ 4.17 (see partial integration to base).

The following benchmark compares the time taken in milliseconds to process all the Unicode code points (except surrogates, private use areas and unassigned), for base-4.16 (GHC 9.2.6) and this package (v0.4). Machine: 8 × AMD Ryzen 5 2500U on Linux.

      base:           OK (1.19s)
        17.1 ms ± 241 μs
      unicode-data:   OK (0.52s)
        3.58 ms ± 125 μs, 0.21x
      base:           OK (0.63s)
        17.5 ms ± 359 μs
      unicode-data:   OK (1.02s)
        3.58 ms ±  48 μs, 0.21x
      base:           OK (0.59s)
        16.3 ms ± 524 μs
      unicode-data:   OK (0.80s)
        5.63 ms ± 129 μs, 0.35x
      base:           OK (3.91s)
        14.9 ms ± 427 μs
      unicode-data:   OK (2.84s)
        5.31 ms ±  37 μs, 0.36x
      base:           OK (2.12s)
        15.4 ms ± 234 μs
      unicode-data:   OK (0.86s)
        5.80 ms ± 159 μs, 0.38x
      base:           OK (1.16s)
        16.6 ms ± 534 μs
      unicode-data:   OK (0.62s)
        4.14 ms ± 103 μs, 0.25x
      base:           OK (0.62s)
        17.1 ms ± 655 μs
      unicode-data:   OK (0.97s)
        3.59 ms ±  51 μs, 0.21x
      base:           OK (0.63s)
        17.6 ms ± 494 μs
      unicode-data:   OK (0.57s)
        3.59 ms ±  90 μs, 0.20x
      base:           OK (0.34s)
        17.6 ms ± 695 μs
      unicode-data:   OK (1.00s)
        3.59 ms ±  67 μs, 0.20x
      base:           OK (1.22s)
        17.7 ms ± 492 μs
      unicode-data:   OK (1.92s)
        3.56 ms ±  27 μs, 0.20x
      base:           OK (2.23s)
        16.6 ms ± 619 μs
      unicode-data:   OK (1.05s)
        3.60 ms ±  52 μs, 0.22x
      base:           OK (1.15s)
        16.6 ms ± 439 μs
      unicode-data:   OK (0.49s)
        3.60 ms ±  85 μs, 0.22x
      base:           OK (2.11s)
        16.1 ms ± 553 μs
      unicode-data:   OK (1.05s)
        3.58 ms ±  62 μs, 0.22x
      base:           OK (0.58s)
        17.2 ms ± 502 μs
      unicode-data:   OK (1.02s)
        3.58 ms ±  50 μs, 0.21x
      base:           OK (8.57s)
        16.4 ms ± 553 μs
      unicode-data:   OK (1.05s)
        3.58 ms ±  79 μs, 0.22x
      base:           OK (1.09s)
        7.56 ms ± 159 μs
      unicode-data:   OK (0.97s)
        3.58 ms ±  46 μs, 0.47x
      base:           OK (0.58s)
        15.7 ms ± 462 μs
      unicode-data:   OK (0.58s)
        3.58 ms ± 107 μs, 0.23x

Partial integration of unicode-data into base

Since base 4.18, unicode-data has been partially integrated to GHC, so there should be no relevant difference. However, using unicode-data allows to select the exact version of Unicode to support, therefore not relying on the version supported by GHC.

Unicode database version update

To update the Unicode version please update the version number in ucd.sh.

To download the Unicode database, run ucd.sh download from the top level directory of the repo to fetch the database in ./ucd.

$ ./ucd.sh download

To generate the Haskell data structure files from the downloaded database files, run ucd.sh generate from the top level directory of the repo.

$ ./ucd.sh generate

Running property doctests

Temporarily add QuickCheck to build depends of library.

$ cabal build
$ cabal-docspec --check-properties --property-variables c


unicode-data is an open source project available under a liberal Apache-2.0 license.


As an open project we welcome contributions.