tiktoken: Haskell implementation of tiktoken

[ bsd3, library, unclassified ] [ Propose Tags ] [ Report a vulnerability ]

This packages only implements tokenization. In other words, given an existing encoding (cl100k_base) you can tokenize an input.

[Skip to Readme]

Modules

[Index] [Quick Jump]

Tiktoken

Downloads

tiktoken-1.0.3.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

GabrielGonzalez

For package maintainers and hackage trustees

edit package information

Candidates

1.0.0, 1.0.2

Versions [RSS]	1.0.0, 1.0.1, 1.0.2, 1.0.3
Change log	CHANGELOG.md
Dependencies	base (>=4.15.0.0 && <5), base64 (>=1.0 && <1.1), bytestring (>=0.11.3.0), containers (>=0.5.0.0), deepseq (>=1.4.0.0), filepath, megaparsec (<9.7), pcre-light (>=0.2), raw-strings-qq, text, unordered-containers [details]
License	BSD-3-Clause
Author	Gabriella Gonzalez
Maintainer	GenuineGabriella@gmail.com
Uploaded	by GabrielGonzalez at 2024-09-02T21:19:08Z
Distributions
Downloads	135 total (10 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2024-09-02 [all 1 reports]

Readme for tiktoken-1.0.3

[back to package description]

`tiktoken`

This is a Haskell implementation of tiktoken, but just the tokenization logic. In other words, given an existing encoding (like cl100k_base) you can tokenize a string (into smaller strings or token ranks).

This means that you can't (yet) use this package to create your own new encodings, but you can use it to consume encodings. In particular, this comes in handy for prompt engineering where you want to use as much of the available prompt tokens as possible (which requires accurately counting tokens).

Encoding speed is ≈2.6-3.1 MB/s on an M1 MacBook Pro (using only one core since this package does not yet support parallel tokenization):

All
  Encode 10 MB of Wikipedia
    r50k_base:   OK (23.88s)
      3.356 s ± 151 ms
    p50k_base:   OK (10.39s)
      3.445 s ±  31 ms
    p50k_edit:   OK (11.13s)
      3.693 s ± 240 ms
    cl100k_base: OK (11.16s)
      3.685 s ± 143 ms
    o200k_base:  OK (11.01s)
      3.648 s ± 134 ms