Copyright	(c) 2010 Bryan O'Sullivan
License	BSD-style
Maintainer	bos@serpentine.com
Stability	experimental
Portability	GHC
Safe Haskell	None
Language	Haskell98

Data.Text.ICU.Break

Contents

Types
Breaking functions
Iteration functions
Iterator status
Locales

Description

String breaking functions for Unicode, implemented as bindings to the International Components for Unicode (ICU) libraries.

The text boundary positions are found according to the rules described in Unicode Standard Annex #29, Text Boundaries, and Unicode Standard Annex #14, Line Breaking Properties. These are available at http://www.unicode.org/reports/tr14/ and http://www.unicode.org/reports/tr29/.

Synopsis

data BreakIterator a
data Line
- = Soft
- | Hard
data Word
- = Uncategorized
- | Number
- | Letter
- | Kana
- | Ideograph
breakCharacter :: LocaleName -> Text -> IO (BreakIterator ())
breakLine :: LocaleName -> Text -> IO (BreakIterator Line)
breakSentence :: LocaleName -> Text -> IO (BreakIterator ())
breakWord :: LocaleName -> Text -> IO (BreakIterator Word)
clone :: BreakIterator a -> IO (BreakIterator a)
setText :: BreakIterator a -> Text -> IO ()
current :: BreakIterator a -> IO (Maybe I16)
first :: BreakIterator a -> IO I16
last :: BreakIterator a -> IO I16
next :: BreakIterator a -> IO (Maybe I16)
previous :: BreakIterator a -> IO (Maybe I16)
preceding :: BreakIterator a -> Int -> IO (Maybe I16)
following :: BreakIterator a -> Int -> IO (Maybe I16)
isBoundary :: BreakIterator a -> Int -> IO Bool
getStatus :: BreakIterator a -> IO a
getStatuses :: BreakIterator a -> IO [a]
available :: [LocaleName]

Types

data BreakIterator a Source #

data Line Source #

Line break status.

Constructors

Soft	A soft line break is a position at which a line break is acceptable, but not required.
Hard

Instances

Instances details

Enum Line Source #
Instance details Defined in Data.Text.ICU.Break Methods succ :: Line -> Line # pred :: Line -> Line # toEnum :: Int -> Line # fromEnum :: Line -> Int # enumFrom :: Line -> [Line] # enumFromThen :: Line -> Line -> [Line] # enumFromTo :: Line -> Line -> [Line] # enumFromThenTo :: Line -> Line -> Line -> [Line] #
Eq Line Source #
Instance details Defined in Data.Text.ICU.Break Methods (==) :: Line -> Line -> Bool # (/=) :: Line -> Line -> Bool #
Show Line Source #
Instance details Defined in Data.Text.ICU.Break Methods showsPrec :: Int -> Line -> ShowS # show :: Line -> String # showList :: [Line] -> ShowS #
NFData Line Source #
Instance details Defined in Data.Text.ICU.Break Methods rnf :: Line -> () #

data Word Source #

Word break status.

Constructors

Uncategorized	A "word" that does not fit into another category. Includes spaces and most punctuation.
Number	A word that appears to be a number.
Letter	A word containing letters, excluding hiragana, katakana or ideographic characters.
Kana	A word containing kana characters.
Ideograph	A word containing ideographic characters.

Instances

Instances details

Enum Word Source #
Instance details Defined in Data.Text.ICU.Break Methods succ :: Word -> Word # pred :: Word -> Word # toEnum :: Int -> Word # fromEnum :: Word -> Int # enumFrom :: Word -> [Word] # enumFromThen :: Word -> Word -> [Word] # enumFromTo :: Word -> Word -> [Word] # enumFromThenTo :: Word -> Word -> Word -> [Word] #
Eq Word Source #
Instance details Defined in Data.Text.ICU.Break Methods (==) :: Word -> Word -> Bool # (/=) :: Word -> Word -> Bool #
Show Word Source #
Instance details Defined in Data.Text.ICU.Break Methods showsPrec :: Int -> Word -> ShowS # show :: Word -> String # showList :: [Word] -> ShowS #
NFData Word Source #
Instance details Defined in Data.Text.ICU.Break Methods rnf :: Word -> () #

Breaking functions

breakCharacter :: LocaleName -> Text -> IO (BreakIterator ()) Source #

Break a string on character boundaries.

Character boundary analysis identifies the boundaries of "Extended Grapheme Clusters", which are groupings of codepoints that should be treated as character-like units for many text operations. Please see Unicode Standard Annex #29, Unicode Text Segmentation, http://www.unicode.org/reports/tr29/ for additional information on grapheme clusters and guidelines on their use.

breakLine :: LocaleName -> Text -> IO (BreakIterator Line) Source #

Break a string on line boundaries.

Line boundary analysis determines where a text string can be broken when line wrapping. The mechanism correctly handles punctuation and hyphenated words.

breakSentence :: LocaleName -> Text -> IO (BreakIterator ()) Source #

Break a string on sentence boundaries.

Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses.

breakWord :: LocaleName -> Text -> IO (BreakIterator Word) Source #

Break a string on word boundaries.

Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word breaks on both sides.

clone :: BreakIterator a -> IO (BreakIterator a) Source #

Thread safe cloning operation. This is substantially faster than creating a new BreakIterator from scratch.

setText :: BreakIterator a -> Text -> IO () Source #

Point an existing BreakIterator at a new piece of text.

Iteration functions

Important note: All of the indices accepted and returned by functions in this module are offsets into the raw UTF-16 text array, not a count of code points.

current :: BreakIterator a -> IO (Maybe I16) Source #

Return the character index most recently returned by next, previous, first, or last.

first :: BreakIterator a -> IO I16 Source #

Reset the breaker to the beginning of the text to be scanned.

last :: BreakIterator a -> IO I16 Source #

Reset the breaker to the end of the text to be scanned.

next :: BreakIterator a -> IO (Maybe I16) Source #

Advance the iterator and break at the text boundary that follows the current text boundary.

previous :: BreakIterator a -> IO (Maybe I16) Source #

Advance the iterator and break at the text boundary that precedes the current text boundary.

preceding :: BreakIterator a -> Int -> IO (Maybe I16) Source #

Determine the text boundary preceding the specified offset.

following :: BreakIterator a -> Int -> IO (Maybe I16) Source #

Determine the text boundary following the specified offset.

isBoundary :: BreakIterator a -> Int -> IO Bool Source #

Determine whether the specfied position is a boundary position. As a side effect, leaves the iterator pointing to the first boundary position at or after the given offset.

Iterator status

getStatus :: BreakIterator a -> IO a Source #

Return the status from the break rule that determined the most recently returned break position. For rules that do not specify a status, a default value of () is returned.

getStatuses :: BreakIterator a -> IO [a] Source #

Return statuses from all of the break rules that determined the most recently returned break position.

Locales

available :: [LocaleName] Source #

Locales for which text breaking information is available. A BreakIterator in a locale in this list will perform the correct text breaking for the locale.