regex-tdfa: Pure Haskell Tagged DFA Backend for "Text.Regex" (regex-base)

This is a package candidate release! Here you can preview how this package release will appear once published to the main package index (which can be accomplished via the 'maintain' link below). Please note that once a package has been published to the main package index it cannot be undone! Please consult the package uploading documentation for more information.

[maintain] [Publish]

Warnings:

'ghc-options: -O2' is rarely needed. Check that it is giving a real benefit and not just imposing longer compile times on your users.

This package provides a pure Haskell "Tagged" DFA regex engine for regex-base. This implementation was inspired by the algorithm (and Master's thesis) behind the regular expression library known as TRE or libtre.

Please consult the Text.Regex.TDFA module for API documentation including a tutorial with usage examples; see also https://wiki.haskell.org/Regular_expressions for general information about regular expression support in Haskell.

[Skip to Readme]

Properties

Versions	0.92, 0.94, 0.95.1, 0.95.2, 0.97.1, 0.97.3, 0.97.4, 1.0.0, 1.1.0, 1.1.1, 1.1.2, 1.1.3, 1.1.4, 1.1.6, 1.1.7, 1.1.8, 1.2.0, 1.2.1, 1.2.2, 1.2.3, 1.2.3.1, 1.2.3.2, 1.2.3.3, 1.3.0, 1.3.1.0, 1.3.1.1, 1.3.1.2, 1.3.1.3, 1.3.1.3, 1.3.1.4, 1.3.1.5, 1.3.2, 1.3.2.1, 1.3.2.2, 1.3.2.3
Change log	CHANGELOG.md
Dependencies	array (>=0.4 && <0.6), base (>=4.5 && <5), bytestring (>=0.9.2 && <0.12), containers (>=0.4.2 && <0.7), fail (>=4.9 && <4.10), mtl (>=2.1.3 && <2.4), parsec (>=3.1 && <3.2), regex-base (>=0.94 && <0.95), semigroups (>=0.18 && <0.20), text (>=1.2.3 && <2.1) [details]
License	BSD-3-Clause
Copyright	Copyright (c) 2007-2009, Christopher Kuklewicz
Author	Christopher Kuklewicz
Maintainer	Andreas Abel
Category	Text
Home page	https://wiki.haskell.org/Regular_expressions
Bug tracker	https://github.com/haskell-hvr/regex-tdfa/issues
Source repo	head: git clone https://github.com/haskell-hvr/regex-tdfa.git this: git clone https://github.com/haskell-hvr/regex-tdfa.git(tag v1.3.1.3)
Uploaded	by AndreasAbel at 2022-07-15T12:01:57Z

Modules

[Index] [Quick Jump]

Flags

Manual Flags

Name Description Default

Name	Description	Default
force-o2	Force building `regex-tdfa` with "`ghc-options: -O2`". NOTE: This flag is mostly provided for legacy use-cases. Nowadays you can conveniently control optimization levels on a per-package granularity via `cabal.project` files; see cabal's user-guide for more details.	Disabled

force-o2

Force building regex-tdfa with "ghc-options: -O2".

NOTE: This flag is mostly provided for legacy use-cases. Nowadays you can conveniently control optimization levels on a per-package granularity via cabal.project files; see cabal's user-guide for more details.

Disabled

Use -f <flag> to enable a flag, or -f -<flag> to disable that flag. More info

Downloads

regex-tdfa-1.3.1.3.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

AndreasAbel, AudreyTang, ChrisKuklewicz, HerbertValerioRiedel, phadej, topos

For package maintainers and hackage trustees

edit package information

Readme for regex-tdfa-1.3.1.3

[back to package description]

regex-tdfa

This is regex-tdfa which is a pure Haskell regular expression library (for POSIX extended regular expressions) originally written by Christopher Kuklewicz.

The name "tdfa" stands for Tagged-DFA.

Getting started

Importing and using

Declare a dependency on the regex-tdfa library in your .cabal file:

build-depends: regex-tdfa ^>= 1.3.1

In Haskell modules where you need to use regexes import the respective regex-tdfa module:

import Text.Regex.TDFA

Basics

λ> emailRegex = "[a-zA-Z0-9+._-]+@[a-zA-Z-]+\\.[a-z]+"
λ> "my email is email@email.com" =~ emailRegex :: Bool
>>> True

-- non-monadic
<to-match-against> =~ <regex>

-- monadic, uses 'fail' on lack of match
<to-match-against> =~~ <regex>

(=~) and (=~~) are polymorphic in their return type. This is so that regex-tdfa can pick the most efficient way to give you your result based on what you need. For instance, if all you want is to check whether the regex matched or not, there's no need to allocate a result string. If you only want the first match, rather than all the matches, then the matching engine can stop after finding a single hit.

This does mean, though, that you may sometimes have to explicitly specify the type you want, especially if you're trying things out at the REPL.

Common use cases

Get the first match

-- returns empty string if no match
a =~ b :: String  -- or ByteString, or Text...

λ> "alexis-de-tocqueville" =~ "[a-z]+" :: String
>>> "alexis"

λ> "alexis-de-tocqueville" =~ "[[:digit:]]+" :: String
>>> ""

Check if it matched at all

a =~ b :: Bool

λ> "alexis-de-tocqueville" =~ "[a-z]+" :: Bool
>>> True

Get first match + text before/after

-- if no match, will just return whole
-- string in the first element of the tuple
a =~ b :: (String, String, String)

λ> "alexis-de-tocqueville" =~ "de" :: (String, String, String)
>>> ("alexis-", "de", "-tocqueville")

λ> "alexis-de-tocqueville" =~ "kant" :: (String, String, String)
>>> ("alexis-de-tocqueville", "", "")

Get first match + submatches

-- same as above, but also returns a list of /just/ submatches
-- submatch list is empty if regex doesn't match at all
a =~ b :: (String, String, String, [String])

λ> "div[attr=1234]" =~ "div\\[([a-z]+)=([^]]+)\\]"
     :: (String, String, String, [String])
>>> ("", "div[attr=1234]", "", ["attr","1234"])

Get all non-overlapping matches

-- can also return Data.Array instead of List
getAllTextMatches (a =~ b) :: [String]

λ> getAllTextMatches ("john anne yifan" =~ "[a-z]+") :: [String]
>>> ["john","anne","yifan"]

λ> getAllTextMatches ("0a0b0" =~ "0[[:lower:]]0") :: [String]
>>> ["0a0"]

Note that "0b0" is not included in the result since it overlaps with "0a0".

Special characters

regex-tdfa only supports a small set of special characters and is much less featureful than some other regex engines you might be used to, such as PCRE.

\` — Match start of entire text (similar to ^ in other regex engines)
\' — Match end of entire text (similar to $ in other regex engines)
\< — Match beginning of word
\> — Match end of word
\b — Match beginning or end of word
\B — Match neither beginning nor end of word

While shorthands like \d (for digit) are not recognized, one can use the respective POSIX character class inside [...]. E.g., [[:digit:][:lower:]_] is short for [0-9a-z_]. The supported character classes are listed on Wikipedia and defined in module TNFA.

Please also consult a variant of this documentation which is part of the Text.Regex.TDFA haddock, and the original documentation at the Haskell wiki.

Less common stuff

Get match indices

-- can also return Data.Array instead of List
getAllMatches (a =~ b) :: [(Int, Int)]  -- (index, length)

λ> getAllMatches ("john anne yifan" =~ "[a-z]+") :: [(Int, Int)]
>>> [(0,4), (5,4), (10,5)]

Get submatch indices

-- match of __entire__ regex is first element, not first capture
-- can also return Data.Array instead of List
getAllSubmatches (a =~ b) :: [(Int, Int)]  -- (index, length)

λ> getAllSubmatches ("div[attr=1234]" =~ "div\\[([a-z]+)=([^]]+)\\]")
     :: [(Int, Int)]
>>> [(0,14), (4,4), (9,4)]

Replacement

regex-tdfa does not provide find-and-replace.

Avoiding backslashes

If you find yourself writing a lot of regexes, take a look at raw-strings-qq. It'll let you write regexes without needing to escape all your backslashes.

{-# LANGUAGE QuasiQuotes #-}

import Text.RawString.QQ
import Text.Regex.TDFA

λ> "2 * (3 + 1) / 4" =~ [r|\([^)]+\)|] :: String
>>> "(3 + 1)"

Known bugs and infelicities

Regexes with large character classes combined with {m,n} are very slow and memory-hungry (#3).

An example of such a regex is ^[\x0020-\xD7FF]{1,255}$.
POSIX submatch semantics are broken in some rare cases (#2).

About this package

This was inspired by the algorithm (and Master's thesis) behind the regular expression library known as TRE or libtre. This was created by Ville Laurikari and tackled the difficult issue of efficient sub-match capture for POSIX regular expressions.

By building on this thesis and adding a few more optimizations, regex-tdfa matching input text of length N should have O(N) runtime, and should have a maximum memory bounded by the pattern size that does not scale with N. It should do this while returning well defined (and correct) values for the parenthesized sub-matches.

Regardless of performance, nearly every single OS and Libra for POSIX regular expressions has bugs in sub-matches. This was detailed on the Regex POSIX Haskell wiki page, and can be demonstrated with the regex-posix-unittest suite of checks. Test regex-tdfa-unittest should show regex-tdfa passing these same checks. I owe my understanding of the correct behvior and many of these unit tests to Glenn Fowler at AT&T ("An Interpretation of the POSIX regex Standard").

Maintainance history

The original Darcs repository was at code.haskell.org. For a while a fork was maintained by Roman Cheplyaka as regex-tdfa-rc.

Then the repository moved to https://github.com/ChrisKuklewicz/regex-tdfa, which was primarily maintained by Artyom (neongreen).

Finally, maintainership was passed on again and the repository moved to its current location at https://github.com/haskell-hvr/regex-tdfa.

Searching for "tdfa" on hackage finds some related packages (unmaintained as of 2022-07-14).

Document notes

This README was originally written 2016-04-30.

regex-tdfa: Pure Haskell Tagged DFA Backend for "Text.Regex" (regex-base)

Properties

Modules

Flags

Manual Flags

Downloads

Maintainer's Corner

Readme for regex-tdfa-1.3.1.3

regex-tdfa

Getting started

Importing and using

Basics

Common use cases

Get the first match

Check if it matched at all

Get first match + text before/after

Get first match + submatches

Get all non-overlapping matches

Special characters

Less common stuff

Get match indices

Get submatch indices

Replacement

Avoiding backslashes

Known bugs and infelicities

About this package

Maintainance history

Other related packages

Document notes