Copyright | (c) 2015 Ben Hamilton |
---|---|
License | BSD-style |
Maintainer | bgertzfield@gmail.com |
Stability | experimental |
Portability | GHC |
Safe Haskell | Safe-Inferred |
Language | Haskell98 |
Synopsis
- data MSpoof
- data OpenFromSourceParseError = OpenFromSourceParseError {
- errFile :: OpenFromSourceParseErrorFile
- parseError :: ParseError
- data SpoofCheck
- data SpoofCheckResult
- data RestrictionLevel
- data SkeletonTypeOverride
- open :: IO MSpoof
- openFromSerialized :: ByteString -> IO MSpoof
- openFromSource :: (ByteString, ByteString) -> IO MSpoof
- getSkeleton :: MSpoof -> Maybe SkeletonTypeOverride -> Text -> IO Text
- getChecks :: MSpoof -> IO [SpoofCheck]
- setChecks :: MSpoof -> [SpoofCheck] -> IO ()
- getRestrictionLevel :: MSpoof -> IO (Maybe RestrictionLevel)
- setRestrictionLevel :: MSpoof -> RestrictionLevel -> IO ()
- getAllowedLocales :: MSpoof -> IO [String]
- setAllowedLocales :: MSpoof -> [String] -> IO ()
- areConfusable :: MSpoof -> Text -> Text -> IO SpoofCheckResult
- spoofCheck :: MSpoof -> Text -> IO SpoofCheckResult
- serialize :: MSpoof -> IO ByteString
Unicode spoof checking API
The spoofCheck
, areConfusable
, and getSkeleton
functions analyze
Unicode text for visually confusable (or "spoof") characters.
For example, Latin, Cyrillic, and Greek all contain unique Unicode values which appear nearly identical on-screen:
A 0041 LATIN CAPITAL LETTER A Α 0391 GREEK CAPITAL LETTER ALPHA А 0410 CYRILLIC CAPITAL LETTER A Ꭺ 13AA CHEROKEE LETTER GO ᴀ 1D00 LATIN LETTER SMALL CAPITAL A ᗅ 15C5 CANADIAN SYLLABICS CARRIER GHO A FF21 FULLWIDTH LATIN CAPITAL LETTER A 𐊠 102A0 CARIAN LETTER A 𝐀 1D400 MATHEMATICAL BOLD CAPITAL A
and so on. To check a string for visually confusable characters:
open
anMSpoof
- optionally configure it with
setChecks
,setRestrictionLevel
, and/orsetAllowedLocales
, then spoofCheck
a single string, useareConfusable
to check if two strings could be confused for each other, or usegetSkeleton
to precompute a "skeleton" string (similar to a hash code) which can be cached and re-used to quickly check (using Unicode string comparison) if two strings are confusable.
By default, these methods will use ICU's bundled copy of
confusables.txt
and confusablesWholeScript.txt,
which could be out of date. To provide your own confusables databases, use
openFromSource
. (To avoid repeatedly parsing these databases, you
can then serialize
your configured MSpoof
and later
openFromSerialized
to load the pre-parsed databases.)
Configurable spoof checker wrapping an opaque handle and optionally wrapping a previously serialized instance.
data OpenFromSourceParseError Source #
Exception thrown with openFromSource
fails to parse one of the input files.
OpenFromSourceParseError | |
|
Instances
Exception OpenFromSourceParseError Source # | |
Show OpenFromSourceParseError Source # | |
Defined in Data.Text.ICU.Spoof showsPrec :: Int -> OpenFromSourceParseError -> ShowS # show :: OpenFromSourceParseError -> String # showList :: [OpenFromSourceParseError] -> ShowS # | |
NFData OpenFromSourceParseError Source # | |
Defined in Data.Text.ICU.Spoof rnf :: OpenFromSourceParseError -> () # |
data SpoofCheck Source #
SingleScriptConfusable | Makes |
MixedScriptConfusable | Makes Makes |
WholeScriptConfusable | Makes |
AnyCase | By default, spoof checks assume the strings have been processed
through |
RestrictionLevel | Checks that identifiers are no looser than the specified
level passed to |
Invisible | Checks the identifier for the presence of invisible characters, such as zero-width spaces, or character sequences that are likely not to display, such as multiple occurrences of the same non-spacing mark. |
CharLimit | Checks whether the identifier contains only characters from a
specified set (for example, via |
MixedNumbers | Checks that the identifier contains numbers from only a single script. |
AllChecks | Enables all checks. |
AuxInfo | Enables returning a |
Instances
Bounded SpoofCheck Source # | |
Defined in Data.Text.ICU.Spoof minBound :: SpoofCheck # maxBound :: SpoofCheck # | |
Enum SpoofCheck Source # | |
Defined in Data.Text.ICU.Spoof succ :: SpoofCheck -> SpoofCheck # pred :: SpoofCheck -> SpoofCheck # toEnum :: Int -> SpoofCheck # fromEnum :: SpoofCheck -> Int # enumFrom :: SpoofCheck -> [SpoofCheck] # enumFromThen :: SpoofCheck -> SpoofCheck -> [SpoofCheck] # enumFromTo :: SpoofCheck -> SpoofCheck -> [SpoofCheck] # enumFromThenTo :: SpoofCheck -> SpoofCheck -> SpoofCheck -> [SpoofCheck] # | |
Show SpoofCheck Source # | |
Defined in Data.Text.ICU.Spoof showsPrec :: Int -> SpoofCheck -> ShowS # show :: SpoofCheck -> String # showList :: [SpoofCheck] -> ShowS # | |
Eq SpoofCheck Source # | |
Defined in Data.Text.ICU.Spoof (==) :: SpoofCheck -> SpoofCheck -> Bool # (/=) :: SpoofCheck -> SpoofCheck -> Bool # |
data SpoofCheckResult Source #
CheckOK | The string passed all configured spoof checks. |
CheckFailed [SpoofCheck] | The string failed one or more spoof checks. |
CheckFailedWithRestrictionLevel | The string failed one or more spoof checks, and failed to pass the configured restriction level. |
|
Instances
Show SpoofCheckResult Source # | |
Defined in Data.Text.ICU.Spoof showsPrec :: Int -> SpoofCheckResult -> ShowS # show :: SpoofCheckResult -> String # showList :: [SpoofCheckResult] -> ShowS # | |
Eq SpoofCheckResult Source # | |
Defined in Data.Text.ICU.Spoof (==) :: SpoofCheckResult -> SpoofCheckResult -> Bool # (/=) :: SpoofCheckResult -> SpoofCheckResult -> Bool # |
data RestrictionLevel Source #
ASCII | Checks that the string contains only Unicode values in the range ߝ inclusive. |
SingleScriptRestrictive | Checks that the string contains only characters from a single script. |
HighlyRestrictive | Checks that the string contains only characters from a single script, or from the combinations (Latin + Han + Hiragana + Katakana), (Latin + Han + Bopomofo), or (Latin + Han + Hangul). |
ModeratelyRestrictive | Checks that the string contains only characters from the combinations (Latin + Cyrillic + Greek + Cherokee), (Latin + Han + Hiragana + Katakana), (Latin + Han + Bopomofo), or (Latin + Han + Hangul). |
MinimallyRestrictive | Allows arbitrary mixtures of scripts. |
Unrestrictive | Allows any valid identifiers, including characters outside of the Identifier Profile. |
Instances
Bounded RestrictionLevel Source # | |
Defined in Data.Text.ICU.Spoof | |
Enum RestrictionLevel Source # | |
Defined in Data.Text.ICU.Spoof succ :: RestrictionLevel -> RestrictionLevel # pred :: RestrictionLevel -> RestrictionLevel # toEnum :: Int -> RestrictionLevel # fromEnum :: RestrictionLevel -> Int # enumFrom :: RestrictionLevel -> [RestrictionLevel] # enumFromThen :: RestrictionLevel -> RestrictionLevel -> [RestrictionLevel] # enumFromTo :: RestrictionLevel -> RestrictionLevel -> [RestrictionLevel] # enumFromThenTo :: RestrictionLevel -> RestrictionLevel -> RestrictionLevel -> [RestrictionLevel] # | |
Show RestrictionLevel Source # | |
Defined in Data.Text.ICU.Spoof showsPrec :: Int -> RestrictionLevel -> ShowS # show :: RestrictionLevel -> String # showList :: [RestrictionLevel] -> ShowS # | |
Eq RestrictionLevel Source # | |
Defined in Data.Text.ICU.Spoof (==) :: RestrictionLevel -> RestrictionLevel -> Bool # (/=) :: RestrictionLevel -> RestrictionLevel -> Bool # |
data SkeletonTypeOverride Source #
SkeletonSingleScript | By default, |
SkeletonAnyCase | By default, |
Instances
Functions
Open a spoof checker for checking Unicode strings for lookalike
security issues with default options (all SpoofCheck
s except
CharLimit
).
openFromSerialized :: ByteString -> IO MSpoof Source #
openFromSource :: (ByteString, ByteString) -> IO MSpoof Source #
Open a spoof checker with custom rules given the UTF-8 encoded
contents of the confusables.txt
and confusablesWholeScript.txt
files as described in Unicode UAX #39.
getSkeleton :: MSpoof -> Maybe SkeletonTypeOverride -> Text -> IO Text Source #
Generates re-usable "skeleton" strings which can be used (via Unicode equality) to check if an identifier is confusable with some large set of existing identifiers.
If you cache the returned strings in storage, you must invalidate your cache any time the underlying confusables database changes (i.e., on ICU upgrade).
By default, assumes all input strings have been passed through
toCaseFold
and are lower-case. To change this, pass
SkeletonAnyCase
.
By default, builds skeletons which catch visually confusable
characters across multiple scripts. Pass SkeletonSingleScript
to
override that behavior and build skeletons which catch visually
confusable characters across single scripts.
setChecks :: MSpoof -> [SpoofCheck] -> IO () Source #
Configure the checks performed by a spoof checker.
getRestrictionLevel :: MSpoof -> IO (Maybe RestrictionLevel) Source #
Get the restriction level of a spoof checker.
setRestrictionLevel :: MSpoof -> RestrictionLevel -> IO () Source #
Configure the restriction level of a spoof checker.
getAllowedLocales :: MSpoof -> IO [String] Source #
Get the list of locale names allowed to be used with a spoof checker.
(We don't use LocaleName
since the root and default locales have no
meaning here.)
setAllowedLocales :: MSpoof -> [String] -> IO () Source #
Get the list of locale names allowed to be used with a spoof checker.
(We don't use LocaleName
since the root and default locales have no
meaning here.)
areConfusable :: MSpoof -> Text -> Text -> IO SpoofCheckResult Source #
Check if two strings could be confused with each other.
spoofCheck :: MSpoof -> Text -> IO SpoofCheckResult Source #
Checks if a string could be confused with any other.
serialize :: MSpoof -> IO ByteString Source #
Serializes the rules in this spoof checker to a byte array,
suitable for re-use by openFromSerialized
.
Only includes any data provided to openFromSource
. Does not
include any other state or configuration.