Safe Haskell | None |
---|---|
Language | Haskell2010 |
WARC (or Web ARCive) is a archival file format widely used to distribute corpora of crawled web content (see, for instance the Common Crawl corpus). A WARC file consists of a set of records, each of which describes a web request or response.
This module provides a streaming parser and encoder for WARC archives for use
with the pipes
package.
- type Warc m a = FreeT (Record m) m (Producer ByteString m a)
- data Record m r = Record {
- recHeader :: RecordHeader
- recContent :: Producer ByteString m r
- parseWarc :: (Functor m, Monad m) => Producer ByteString m a -> Warc m a
- iterRecords :: forall m a. Monad m => (forall b. Record m b -> m b) -> Warc m a -> m (Producer ByteString m a)
- produceRecords :: forall m o a. Monad m => (forall b. RecordHeader -> Producer ByteString m b -> Producer o m b) -> Warc m a -> Producer o m (Producer ByteString m a)
- encodeRecord :: Monad m => Record m a -> Producer ByteString m a
- module Data.Warc.Header
Documentation
type Warc m a = FreeT (Record m) m (Producer ByteString m a) Source #
A WARC archive.
This represents a sequence of records followed by whatever data was leftover from the parse.
A WARC record
This represents a single record of a WARC file, consisting of a set of headers and a means of producing the record's body.
Record | |
|
Parsing
:: (Functor m, Monad m) | |
=> Producer ByteString m a | a producer of a stream of WARC content |
-> Warc m a | the parsed WARC archive |
Parse a WARC archive.
Note that this function does not actually do any parsing itself;
it merely returns a Warc
value which can then be run to parse
individual records.
:: Monad m | |
=> (forall b. Record m b -> m b) | the action to run on each |
-> Warc m a | the |
-> m (Producer ByteString m a) | returns any leftover data |
Iterate over the Record
s in a WARC archive
:: Monad m | |
=> (forall b. RecordHeader -> Producer ByteString m b -> Producer o m b) | consume the record producing some output |
-> Warc m a | a WARC archive (see |
-> Producer o m (Producer ByteString m a) | returns any leftover data |
Encoding
encodeRecord :: Monad m => Record m a -> Producer ByteString m a Source #
Encode a Record
in WARC format.
Headers
module Data.Warc.Header