algebraic-graphs-io-0.2: I/O utilities and datasets for algebraic-graphs
Safe HaskellNone
LanguageHaskell2010

Algebra.Graph.IO.Datasets.LINQS.Citeseer

Description

Citeseer document classification dataset, from :

Qing Lu, and Lise Getoor. "Link-based classification." ICML, 2003.

https://linqs.soe.ucsc.edu/data

Synopsis

1. Download the dataset

stash Source #

Arguments

:: FilePath

directory where the data files will be saved

-> IO () 

Download, parse, serialize and save the dataset to local storage

2. Reconstruct the citation graph

citeseerGraph Source #

Arguments

:: FilePath

directory where the data files were saved

-> IO (Graph ContentRow) 

Reconstruct the citation graph

NB : relies on the user having stashed the dataset to local disk first.

citeseerGraphEdges Source #

Arguments

:: (MonadResource m, MonadThrow m) 
=> FilePath

directory of data files

-> Map String (Seq Int16, DocClass)

content data

-> ConduitT i (Maybe (Graph ContentRow)) m () 

Stream out the edges of the citation graph, in which the nodes are decorated with the document metadata.

The full citation graph can be reconstructed by folding over this stream and overlaying the graph edges as they arrive.

This way the graph can be partitioned in training , test and validation subsets at the usage site

restoreContent Source #

Arguments

:: FilePath

directory where the data files are saved

-> IO (Map String (Seq Int16, DocClass)) 

Load the graph node data from local storage

Types

data ContentRow Source #

Dataset row of the .content file

The .content file contains descriptions of the papers in the following format:

<paper_id> <word_attributes> <class_label>

The first entry in each line contains the unique string ID of the paper followed by binary values indicating whether each word in the vocabulary is present (indicated by 1) or absent (indicated by 0) in the paper (vocabulary : 3703 unique words). Finally, the last entry in the line contains the class label of the paper.

Constructors

CRow 

Fields

Instances

Instances details
Eq ContentRow Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

Ord ContentRow Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

Show ContentRow Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

Generic ContentRow Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

Associated Types

type Rep ContentRow :: Type -> Type #

Binary ContentRow Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

type Rep ContentRow Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

type Rep ContentRow = D1 ('MetaData "ContentRow" "Algebra.Graph.IO.Datasets.LINQS.Citeseer" "algebraic-graphs-io-0.2-HM3hsaOKtKl5JJVlqGeywp" 'False) (C1 ('MetaCons "CRow" 'PrefixI 'True) (S1 ('MetaSel ('Just "crId") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 String) :*: (S1 ('MetaSel ('Just "crFeatures") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 (Seq Int16)) :*: S1 ('MetaSel ('Just "crClass") 'NoSourceUnpackedness 'NoSourceStrictness 'DecidedLazy) (Rec0 DocClass))))

data DocClass Source #

document classes of the Citeseer dataset

Constructors

Agents 
AI 
DB 
IR 
ML 
HCI 

Instances

Instances details
Enum DocClass Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

Eq DocClass Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

Ord DocClass Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

Show DocClass Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

Generic DocClass Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

Associated Types

type Rep DocClass :: Type -> Type #

Methods

from :: DocClass -> Rep DocClass x #

to :: Rep DocClass x -> DocClass #

Binary DocClass Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

Methods

put :: DocClass -> Put #

get :: Get DocClass #

putList :: [DocClass] -> Put #

type Rep DocClass Source # 
Instance details

Defined in Algebra.Graph.IO.Datasets.LINQS.Citeseer

type Rep DocClass = D1 ('MetaData "DocClass" "Algebra.Graph.IO.Datasets.LINQS.Citeseer" "algebraic-graphs-io-0.2-HM3hsaOKtKl5JJVlqGeywp" 'False) ((C1 ('MetaCons "Agents" 'PrefixI 'False) (U1 :: Type -> Type) :+: (C1 ('MetaCons "AI" 'PrefixI 'False) (U1 :: Type -> Type) :+: C1 ('MetaCons "DB" 'PrefixI 'False) (U1 :: Type -> Type))) :+: (C1 ('MetaCons "IR" 'PrefixI 'False) (U1 :: Type -> Type) :+: (C1 ('MetaCons "ML" 'PrefixI 'False) (U1 :: Type -> Type) :+: C1 ('MetaCons "HCI" 'PrefixI 'False) (U1 :: Type -> Type))))