html-parse: A high-performance HTML tokenizer
This package provides a fast and reasonably robust HTML5 tokenizer built
upon the attoparsec
library. The parsing strategy is based upon the HTML5
parsing specification with few deviations.
For instance,
>>>
parseTokens "<div><h1 class=widget>Hello World</h1><br/>"
[TagOpen "div" [], TagOpen "h1" [Attr "class" "widget"], ContentText "Hello World", TagClose "h1", TagSelfClose "br" []]
The package targets similar use-cases to the venerable tagsoup
library,
but is significantly more efficient, achieving parsing speeds of over 80
megabytes per second on modern hardware and typical web documents.
Here are some typical performance numbers taken from parsing a Wikipedia
article of moderate length:
benchmarking Forced/tagsoup fast Text time 186.1 ms (175.3 ms .. 194.6 ms) 0.999 R² (0.995 R² .. 1.000 R²) mean 191.7 ms (188.9 ms .. 198.3 ms) std dev 5.053 ms (1.092 ms .. 6.809 ms) variance introduced by outliers: 14% (moderately inflated) benchmarking Forced/tagsoup normal Text time 189.7 ms (182.8 ms .. 197.7 ms) 0.999 R² (0.998 R² .. 1.000 R²) mean 196.5 ms (193.1 ms .. 202.1 ms) std dev 5.481 ms (2.141 ms .. 7.383 ms) variance introduced by outliers: 14% (moderately inflated) benchmarking Forced/html-parser time 15.81 ms (15.75 ms .. 15.89 ms) 1.000 R² (1.000 R² .. 1.000 R²) mean 15.72 ms (15.66 ms .. 15.77 ms) std dev 140.9 μs (113.6 μs .. 174.5 μs)
Downloads
- html-parse-0.2.1.0.tar.gz [browse] (Cabal source package)
- Package description (as included in the package)
Maintainer's Corner
For package maintainers and hackage trustees
Candidates
Versions [RSS] | 0.1.0.0, 0.2.0.0, 0.2.0.1, 0.2.0.2, 0.2.1.0 |
---|---|
Change log | changelog.md |
Dependencies | attoparsec (>=0.13 && <0.15), base (>=4.7 && <4.20), containers (>=0.5 && <0.8), deepseq (>=1.3 && <1.6), html-parse, text (>=1.2 && <2.2) [details] |
Tested with | ghc >=8.4 && <8.5, ghc >=8.6 && <8.7, ghc >=8.8 && <8.9, ghc >=8.10 && <8.11, ghc >=9.0 && <9.1, ghc >=9.2 && <9.3, ghc >=9.4 && <9.5 |
License | BSD-3-Clause |
Copyright | (c) 2016 Ben Gamari |
Author | Ben Gamari |
Maintainer | ben@smart-cactus.org |
Category | Text |
Home page | http://github.com/bgamari/html-parse |
Source repo | head: git clone git://github.com/bgamari/html-parse |
Uploaded | by BenGamari at 2023-12-10T15:13:09Z |
Distributions | Arch:0.2.1.0, LTSHaskell:0.2.1.0, NixOS:0.2.1.0 |
Reverse Dependencies | 3 direct, 0 indirect [details] |
Executables | html-parse-length |
Downloads | 3898 total (22 in the last 30 days) |
Rating | (no votes yet) [estimated by Bayesian average] |
Your Rating | |
Status | Docs available [build log] Last success reported on 2023-12-10 [all 1 reports] |