krapsh-0.1.6.1: Haskell bindings for Spark Dataframes and Datasets

Safe HaskellNone
LanguageHaskell2010

Spark.Core.Functions

Synopsis

Documentation

dataset :: (ToSQL a, SQLTypeable a) => [a] -> Dataset a Source #

dataframe :: DataType -> [Cell] -> DataFrame Source #

Creates a dataframe from a list of cells and a datatype.

Wil fail if the content of the cells is not compatible with the data type.

collect :: forall ref a. SQLTypeable a => Column ref a -> LocalData [a] Source #

Collects all the elements of a column into a list.

NOTE: This list is sorted in the canonical ordering of the data type: however the data may be stored by Spark, the result will always be in the same order. This is a departure from Spark, which does not guarantee an ordering on the returned data.

count :: forall a. SQLTypeable a => Dataset a -> LocalData Int Source #

The number of elements in a column.

identity :: ComputeNode loc a -> ComputeNode loc a Source #

The identity function.

Returns a compute node with the same datatype and the same content as the previous node. If the operation of the input has a side effect, this side side effect is *not* reevaluated.

This operation is typically used when establishing an ordering between some operations such as caching or side effects, along with logicalDependencies.

autocache :: Dataset a -> Dataset a Source #

Automatically caches the dataset on a need basis, and performs deallocation when the dataset is not required.

This function marks a dataset as eligible for the default caching level in Spark. The current implementation performs caching only if it can be established that the dataset is going to be involved in more than one shuffling or aggregation operation.

If the dataset has no observable child, no uncaching operation is added: the autocache operation is equivalent to unconditional caching.

cache :: Dataset a -> Dataset a Source #

Caches the dataset.

This function instructs Spark to cache a dataset with the default persistence level in Spark (MEMORY_AND_DISK).

Note that the dataset will have to be evaluated first for the caching to take effect, so it is usual to call count or other aggregrators to force the caching to occur.

uncache :: ComputeNode loc a -> ComputeNode loc a Source #

Uncaches the dataset.

This function instructs Spark to unmark the dataset as cached. The disk and the memory used by Spark in the future.

Unlike Spark, Krapsh is stricter with the uncaching operation: - the argument of cache must be a cached dataset - once a dataset is uncached, its cached version cannot be used again (i.e. it must be recomputed).

Krapsh performs escape analysis and will refuse to run programs with caching issues.

(@@) :: CanRename a txt => a -> txt -> a Source #