warc-0.3.1: A parser for the Web Archive (WARC) format

Safe HaskellNone
LanguageHaskell2010

Data.Warc

Contents

Description

WARC (or Web ARCive) is a archival file format widely used to distribute corpora of crawled web content (see, for instance the Common Crawl corpus). A WARC file consists of a set of records, each of which describes a web request or response.

This module provides a streaming parser and encoder for WARC archives for use with the pipes package.

Synopsis

Documentation

type Warc m a = FreeT (Record m) m (Producer ByteString m a) Source #

A WARC archive.

This represents a sequence of records followed by whatever data was leftover from the parse.

data Record m r Source #

A WARC record

This represents a single record of a WARC file, consisting of a set of headers and a means of producing the record's body.

Constructors

Record 

Fields

Instances

Monad m => Functor (Record m) Source # 

Methods

fmap :: (a -> b) -> Record m a -> Record m b #

(<$) :: a -> Record m b -> Record m a #

Parsing

parseWarc Source #

Arguments

:: (Functor m, Monad m) 
=> Producer ByteString m a

a producer of a stream of WARC content

-> Warc m a

the parsed WARC archive

Parse a WARC archive.

Note that this function does not actually do any parsing itself; it merely returns a Warc value which can then be run to parse individual records.

iterRecords Source #

Arguments

:: Monad m 
=> (forall b. Record m b -> m b)

the action to run on each Record

-> Warc m a

the Warc file

-> m (Producer ByteString m a)

returns any leftover data

Iterate over the Records in a WARC archive

produceRecords Source #

Arguments

:: Monad m 
=> (forall b. RecordHeader -> Producer ByteString m b -> Producer o m b)

consume the record producing some output

-> Warc m a

a WARC archive (see parseWarc)

-> Producer o m (Producer ByteString m a)

returns any leftover data

Encoding

encodeRecord :: Monad m => Record m a -> Producer ByteString m a Source #

Encode a Record in WARC format.

Headers