tiktoken-1.0.2: Haskell implementation of tiktoken
Safe HaskellSafe-Inferred
LanguageHaskell2010

Tiktoken

Description

You can use this module to convert back and forth between a ByteString and its corresponding tokens using an existing encoding like cl100k_base or o200k_base

Example usage:

{-# LANGUAGE OverloadedStrings #-}

import Tiktoken (o200k_base, toTokens, toRanks)

main :: IO ()
main = do
    -- Just ["El"," perro"," come"," las"," man","z","anas"]
    print (toTokens o200k_base "El perro come las manzanas")

    -- Just [4422,96439,3063,1996,873,89,14457]
    print (toRanks o200k_base "El perro come las manzanas")
Synopsis

Encoding

data Encoding Source #

This is an efficient internal representation of an encoding like cl100k_base, p50k_edit, or o200k_base

Instances

Instances details
Generic Encoding Source # 
Instance details

Defined in Tiktoken

Associated Types

type Rep Encoding :: Type -> Type #

Methods

from :: Encoding -> Rep Encoding x #

to :: Rep Encoding x -> Encoding #

NFData Encoding Source # 
Instance details

Defined in Tiktoken

Methods

rnf :: Encoding -> () #

type Rep Encoding Source # 
Instance details

Defined in Tiktoken

tiktokenToEncoding Source #

Arguments

:: ByteString

Regular expression used for coarse-grained splitting of the input

-> Text

The contents of the .tiktoken file

-> Either (ParseErrorBundle Text Void) Encoding 

Parse an encoding from the .tiktoken file format

addSpecialTokens :: Map ByteString Int -> Encoding -> Encoding Source #

Add special tokens to a base Encoding

Stock Encodings

Tokenization

toTokens :: Encoding -> ByteString -> Maybe [ByteString] Source #

Use an Encoding to tokenize a ByteString into smaller ByteStrings

This only fails if you provide an Encoding that cannot rank all possible 1-byte sequences

toRanks :: Encoding -> ByteString -> Maybe [Int] Source #

Use an Encoding to tokenize a ByteString into ranks

This only fails if you provide an Encoding that cannot rank all possible 1-byte sequences

toTokensAndRanks :: Encoding -> ByteString -> Maybe [(Int, ByteString)] Source #

Use an Encoding to tokenize a ByteString into smaller ByteStrings and their associated ranks

This only fails if you provide an Encoding that cannot rank all possible 1-byte sequences

Detokenization

fromTokens :: [ByteString] -> ByteString Source #

Combine a sequence of ByteString tokens back into a ByteString

This is just a synonym for Data.ByteString.concat (no Encoding necessary), provided solely for consistency/convenience.

fromRanks :: Encoding -> [Int] -> Maybe ByteString Source #

Convert a sequence of ranks back into a ByteString

This will fail if you supply any ranks which are not recognized by the Encoding.