{-# OPTIONS_GHC -O2 -Wall #-} {-# OPTIONS_GHC -fno-warn-unused-imports #-} -- | This library provides a way to train a model -- that predicts the "randomness" of an input @'ByteString'@, -- and two datatypes to facilitate this: -- -- @'FreqTrain'@ is a datatype that can be constructed via -- training functions that take @'ByteString'@s as input, and -- can be used with the @'measure'@ function to gather an -- estimate of the aforementioned probability of "randomness". -- -- @'Freq'@ is a datatype that is constructed by calling the @'tabulate'@ -- function on a @'FreqTrain'@. @'Freq'@s are meant solely for using (accessing -- the "randomness" values) the trained model in practise, by making -- significant increases to speed in exchange for less extensibility; -- you can neither make a change to a @'Freq'@ or convert it back to -- a @'FreqTrain'@. In practise this however proves to not be a problem, -- because training usually only happens once. -- -- Laws: -- -- @'measure' (f :: 'FreqTrain') b ≡ 'measure' ('tabulate' f) b@ -- -- -- Below is a simple illustration of how to use this library. -- We are going to write a small command-line application that -- trains on some data, and scores @'ByteString'@s according to how -- random they are. We will say that a @'ByteString'@ is 'random' -- if it scores less than 0.05 (on a scale of 0 to 1), and not random -- otherwise. -- -- First, some imports: -- -- @ -- import Freq -- import Control.Monad (forever) -- -- import qualified Data.ByteString.Char8 as BC -- @ -- -- Next, a list of @'FilePath'@s containing training data. -- The training data here is the same as is provided in -- the sample executable of this library. It consists solely -- of books in the Public Domain. -- -- @ -- trainTexts :: [FilePath] -- trainText -- = fmap (\x -> "txtdocs/" ++ x ++ ".txt") -- -- ^ -- -- | this line just tells us that all -- -- of the training data is in the 'txtdocs' -- -- directory, and has a '.txt' file extension. -- -- -- | These are the text files from which we wish to train. -- -- v -- [ "2000010" -- , "2city10" -- , "80day10" -- , "alcott-little-261" -- , "byron-don-315" -- , "carol10" -- , "center_earth" -- , "defoe-robinson-103" -- , "dracula" -- , "freck10" -- , "invisman" -- , "kipling-jungle-148" -- , "lesms10" -- , "london-call-203" -- , "london-sea-206" -- , "longfellow-paul-210" -- , "madambov" -- , "monroe-d" -- , "moon10" -- , "ozland10" -- , "plgrm10" -- , "sawy210" -- , "speckldb" -- , "swift-modest-171" -- , "time_machine" -- , "war_peace" -- , "white_fang" -- , "zenda10" -- ] -- @ -- -- We are going to use a function provided by this library -- called @'trainWithMany'@. Its type signature is: -- -- @ -- trainWithMany -- :: Foldable t -- => t FilePath -- ^ FilePaths containing training data -- -> IO FreqTrain -- ^ Frequency table generated as a result of training, inside of 'IO' -- @ -- -- In other words, @'trainWithMany'@ takes a bunch of files, -- trains a model with all of the training data contained therein, -- and returns a @'FreqTrain'@ inside of @'IO'@. -- -- And now, we get freaky: -- -- @ -- -- | "passes" returns a message letting the user know whether -- -- or not their input 'ByteString' was most likely random. -- -- Recall that our threshold is 0.05 on a scale of 0 to 1. -- passes :: Double -> String -- passes x -- | x < 0.05 = "Too random!" -- | otherwise = "Looks good to me!" -- -- main :: IO () -- main = do -- !freak <- trainWithMany trainTexts -- -- ^ -- -- | create the trained model -- -- | Note that we do this strictly, -- -- | so that the model is ready to -- -- | go when we intuitively expect it -- -- | to be. -- -- let !freakTable = tabulate freak -- -- ^ -- -- | optimise the trained model for -- -- read access -- -- putStrLn "Done loading frequencies." -- -- ^ -- -- | let the user know that our model -- -- is done training and has finished -- -- optimising into a 'Freq' -- -- forever $ do -- -- ^ -- -- | make the following code loop forever -- -- putStrLn "Enter text:" -- -- ^ -- -- | ask the user for some text -- -- !bs <- BC.getLine -- -- ^ -- -- | bs is the input 'ByteString' to score -- -- let !score = measure freakTable bs -- -- ^ -- -- | score of the 'ByteString'! -- -- putStrLn $ "Score: " ++ show score ++ "\n" -- ++ passes score -- -- ^ -- -- | print out what the score of the 'ByteString' was, -- -- along with its 'passing status'. -- @ -- -- This results in the following interactions, split up for readability: -- -- >>> Done loading frequencies. -- >>> Enter text: -- >>> freq -- >>> Score: 0.10314131395591991 -- >>> Looks good to me! -- -- >>> Enter text: -- >>> kjdslfkajdslkfjsd -- >>> Score: 6.693203041828383e-3 -- >>> Too random! -- -- >>> Enter text: -- >>> William -- >>> Score: 7.086442245879888e-2 -- >>> Looks good to me! -- -- >>> Enter text: -- >>> 8op3u92jf -- >>> Score: 6.687182330334067e-3 -- >>> Too random! -- -- As we can see, it rejects the keysmashed text as being too random, -- while the human-readable text is A-OK. I actually made the threshold -- of 0.05 too high - it should be somewhere between 0.01 and 0.03, but -- even then the outcomes would have still been the same. The digram-based -- approach that 'freq' uses may seem ridiculously naive, but still -- maintains a high degree of accuracy. -- -- As an example of a real-world use case, I wrote 'freq' to use at my -- workplace (I work at a Network Security company) as a way to score -- domain names according to how random they are. Malicious -- users spin up fake domains frequently using strings of random characters. -- This can also be used to score Windows executables, since -- those follow the same pattern of malicious naming. -- -- An obvious weakness of this library is that it suffers from what can -- be referred to as the "xkcd problem". It can score things such as 'xkcd' -- poorly, even though they are perfectly legitimate domains. The fix I use is -- to use something like the alexa top 1 million list of domains, along with a -- HashMap(s) for whitelisting/blacklisting. -- -- As a wise man once told me - "And then I freaked it." module Freq ( -- * Frequency table builder (trainer) type FreqTrain -- * Construction , empty , singleton -- * Training , train , trainWith , trainWithMany -- * Using a trained model , tabulate , Freq , measure , prob -- * Pretty Printing , prettyFreqTrain ) where import Data.ByteString (ByteString) import Freq.Internal