# NGrams This is a code base for experimenting with various approaches to n-gram-based text modeling. To get started, run: ```bash stack build stack install ``` This will build and install the library and binary commands. Generally, the commands expect data to be text files where each line has the format: ``` ${id}${label}${text} ``` When a model is applied to data, the output will generally have a header with the format: ``` IDGOLD${label_1_name}${label_2_name}... ``` and lines with the corresponding format: ``` ${doc_id}${gold_label_name}${label_1_prob}${label_2_prob}... ``` where probabilities are represented as natural logarithms. The remainder of this document describes the implemented models, most of which have a corresponding command that *stack* will have installed. The library aims to be parametric over the sequence types, and most commands allow users to specify whether to consider bytes, unicode characters, or whitespace-delimited tokens. ## Prediction by Partial Matching PPM is essentially an n-gram model with a particular backoff logic that can't quite be reduced to more widespread approaches to smoothing, but empirically tends to outperform them on short documents. To create a PPM model, run: ```bash sh> ppm train --train train.txt --dev dev.txt --n 4 --modelFile model.gz Dev accuracy: 0.8566666666666667 ``` The model can then be applied to new data: ```bash sh> ppm apply --test test.txt --modelFile model.gz --n 4 --scoresFile scores.txt ``` The value of `--n` can also be less than the model size, which will run a bit faster, and (perhaps) less tuned to the original training data.