# streamly-archive
[![Hackage](https://img.shields.io/hackage/v/streamly-archive.svg?style=flat)](https://hackage.haskell.org/package/streamly-archive)
![CI](https://github.com/shlok/streamly-archive/workflows/CI/badge.svg?branch=master)
Stream data from archives (tar, tar.gz, zip, or any other format [supported by libarchive](https://github.com/libarchive/libarchive/wiki/LibarchiveFormats)) using the Haskell [streamly](https://hackage.haskell.org/package/streamly) library.
## Requirements
Install libarchive on your system.
* Debian Linux: `sudo apt-get install libarchive-dev`.
* macOS: `brew install libarchive`.
## Quick start
```haskell
{-# LANGUAGE ScopedTypeVariables, TypeApplications #-}
module Main where
import Crypto.Hash (hashFinalize, hashInit, hashUpdate)
import Crypto.Hash.Algorithms (SHA256)
import Data.ByteString (ByteString)
import Data.Either (isRight)
import Data.Function ((&))
import Data.Maybe (fromJust, fromMaybe)
import Data.Void (Void)
import Streamly.External.Archive (Header, headerPathName, readArchive)
import Streamly.Internal.Data.Fold.Types (Fold (..))
import Streamly.Internal.Data.Unfold.Types (Unfold)
import qualified Streamly.Prelude as S
main :: IO ()
main = do
-- Obtain an unfold for the archive.
-- For each entry in the archive, we will get a Header followed
-- by zero or more ByteStrings containing chunks of file data.
let unf :: Unfold IO Void (Either Header ByteString)
= readArchive "/path/to/archive.tar.gz"
-- Create a fold for converting each entry (which, as we saw
-- above, is a Left followed by zero or more Rights) into a
-- path and corresponding SHA-256 hash (Nothing for no data).
let entryFold :: Fold IO (Either Header ByteString) (String, Maybe String)
= Fold
(\(mpath, mctx) e ->
case e of
Left h -> do
mpath' <- headerPathName h
return (mpath', mctx)
Right bs ->
return (mpath,
Just . (`hashUpdate` bs) $
fromMaybe (hashInit @SHA256) mctx))
(return (Nothing, Nothing))
(\(mpath, mctx) ->
return (show $ fromJust mpath,
show . hashFinalize <$> mctx))
-- Execute the stream, grouping at the headers (the Lefts) using the
-- above fold, and output the paths and SHA-256 hashes along the way.
S.unfold unf undefined
& S.groupsBy (\e _ -> isRight e) entryFold
& S.mapM_ print
```
## Benchmarks
See `./bench/README.md`. We find on our machine† that (1) reading an archive using this library is just as fast as using plain Haskell `IO` code; and that (2) both are somewhere between 1.7x (large files) and 2.5x (many 1-byte files) slower than C.
The former fulfills the promise of [streamly](https://hackage.haskell.org/package/streamly) and stream fusion. The differences to C are presumably explained by the marshalling of data into the Haskell world and are currently small enough for our purposes.
† [Linode](https://linode.com); Debian 10, Dedicated 32GB: 16 CPU, 640GB Storage, 32GB RAM.