haskseg: Simple unsupervised segmentation model

[ bsd3, library, nlp, program ] [ Propose Tags ] [ Report a vulnerability ]

Implementation of the non-parametric segmentation model described in "Type-based MCMC" (Liang, Jordan, and Klein, 2010) and "A Bayesian framework for word segmentation Exploring the effects of context" (Goldwater, Griffiths, and Johnson, 2009).

[Skip to Readme]

Modules

[Index] [Quick Jump]

Text
- HaskSeg

Downloads

haskseg-0.1.0.3.tar.gz [browse] (Cabal source package)
Package description (as included in the package)

Maintainer's Corner

Package maintainers

TomLippincott

For package maintainers and hackage trustees

edit package information

Candidates

No Candidates

Versions [RSS]	0.1.0.0, 0.1.0.1, 0.1.0.2, 0.1.0.3
Dependencies	ansi-terminal (>=0.8.0.4), array, base (>=4.7 && <5), bytestring (>=0.10.8.1), containers (>=0.5.10.2), exact-combinatorics (>=0.2.0.8), haskseg, logging-effect (>=1.3.2), monad-loops (>=0.4.3), MonadRandom (>=0.5.1.1), mtl (>=2.2.2), optparse-generic (>=1.2.2), random (>=1.1), random-shuffle (>=0.0.4), text (>=1.2.2), vector (>=0.12.0.1), zlib (>=0.6.1) [details]
License	BSD-3-Clause
Copyright	2018 Tom Lippincott
Author	Tom Lippincott
Maintainer	tom@cs.jhu.edu
Category	NLP
Home page	https://github.com/TomLippincott/haskseg#README.md
Uploaded	by TomLippincott at 2019-09-26T21:30:02Z
Distributions	NixOS:0.1.0.3
Executables	haskseg
Downloads	1883 total (2 in the last 30 days)
Rating	(no votes yet) [estimated by Bayesian average]
Your Rating	λ λ λ
Status	Docs available [build log] Last success reported on 2019-09-26 [all 1 reports]

Readme for haskseg-0.1.0.3

[back to package description]

haskseg

Compiling

First install Stack somewhere on your PATH. For example, for ~/.local/bin:

wget https://get.haskellstack.org/stable/linux-x86_64.tar.gz -O -|tar xpfz - -C /tmp
cp /tmp/stack-*/stack ~/.local/bin
rm -rf /tmp/stack-*

Then, while in the directory of this README file, run:

stack build

The first time this runs will take a while, 10 or 15 minutes, as it builds an entire Haskell environment from scratch. Subsequent compilations are very fast.

Running

Invoke the program using Stack. To see available sub-commands, run:

stack exec -- haskseg -h

To see detailed help, run e.g.:

stack exec -- haskseg train -h

To train on the included data set from Goldwater, Griffiths, and Johnson 2009, run:

time zcat data/br-old-gold.txt.gz | perl -pe '$_=~s/ /\<GOLD\>/g;' | stack exec -- haskseg train --stateFile model.gz --iterations 3 --goldString "<GOLD>"

Note here the spaces are being replaced by a special string, to indicate boundaries to calculate F-score on (not necessary, but a nice way to track progress). The model seems to converge after three iterations on this data set. This takes about three minutes and achieves F-score of 0.61, somewhat higher than reported in Liang, Jordan and Klein 2010 (why is unclear). To use the trained model to segment text, such as the training data set, run:

time zcat data/br-old-gold.txt.gz | perl -pe '$_=~s/ //g;' | stack exec -- haskseg segment --stateFile model.gz

Note that for this stage, spaces are simply removed, otherwise it treats whitespace as a static boundary (which may be what you want in other circumstances!). The model takes about 4 seconds to segment the 9790 "words", about 2500 per second, though this stage is embarassingly parallel. The output is in BPE format, i.e. with "@@" indicating the end of an internal morph.