A mirror of late Dmitry Zuikov's hbs2 peer-to-peer storage. https://hbs2.net
Go to file
voidlizard cf85d2df2a Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e
git-subtree-dir: miscellaneous/fuzzy-parse
git-subtree-split: a834b152e29d632c816eefe117036e5d9330bd03
2024-10-07 05:06:03 +03:00
misc Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
nix Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
scripts Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
src/Data/Text Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
test Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
.envrc Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
.gitignore Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
CHANGELOG.md Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
LICENSE Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
README.markdown Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
TODO Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
cabal.project Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
flake.lock Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
flake.nix Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00
fuzzy-parse.cabal Squashed 'miscellaneous/fuzzy-parse/' content from commit a834b152e 2024-10-07 05:06:03 +03:00

README.markdown

About

Data.Text.Fuzzy.Tokenize

The lightweight and multi-functional text tokenizer allowing different types of text tokenization depending on it's settings.

It may be used in different sutiations, for DSL, text markups or even for parsing simple grammars easier and sometimes faster than in case of usage mainstream parsing combinators or parser generators.

The primary goal of this package is to parse unstructured text data, however it may be used for parsing such data formats as CSV with ease.

Currently it supports the following types of entities: atoms, string literals (currently with the minimal set of escaped characters), punctuation characters and delimeters.

Examples

Simple CSV-like tokenization

tokenize (delims ":") "aaa : bebeb : qqq ::::" :: [Text]

["aaa "," bebeb "," qqq "]
tokenize (delims ":"<>sq<>emptyFields ) "aaa : bebeb : qqq ::::" :: [Text]

["aaa "," bebeb "," qqq ","","","",""]
tokenize (delims ":"<>sq<>emptyFields ) "aaa : bebeb : qqq ::::" :: [Maybe Text]

[Just "aaa ",Just " bebeb ",Just " qqq ",Nothing,Nothing,Nothing,Nothing]
tokenize (delims ":"<>sq<>emptyFields ) "aaa : 'bebeb:colon inside' : qqq ::::" :: [Maybe Text]

[Just "aaa ",Just " ",Just "bebeb:colon inside",Just " ",Just " qqq ",Nothing,Nothing,Nothing,Nothing]
let spec = sl<>delims ":"<>sq<>emptyFields<>noslits
tokenize spec "   aaa :   'bebeb:colon inside' : qqq ::::" :: [Maybe Text]

[Just "aaa ",Just "bebeb:colon inside ",Just "qqq ",Nothing,Nothing,Nothing,Nothing]
let spec = delims ":"<>sq<>emptyFields<>uw<>noslits
tokenize spec "  a  b  c  : 'bebeb:colon inside' : qqq ::::"  :: [Maybe Text]

[Just "a b c",Just "bebeb:colon inside",Just "qqq",Nothing,Nothing,Nothing,Nothing]

Primitive lisp-like language

{-# LANGUAGE QuasiQuotes, ExtendedDefaultRules #-}

import Text.InterpolatedString.Perl6 (q)
import Data.Text.Fuzzy.Tokenize

data TTok = TChar Char
          | TSChar Char
          | TPunct Char
          | TText Text
          | TStrLit Text
          | TKeyword Text
          | TEmpty
          deriving(Eq,Ord,Show)

instance IsToken TTok where
 mkChar = TChar
 mkSChar = TSChar
 mkPunct = TPunct
 mkText = TText
 mkStrLit = TStrLit
 mkKeyword = TKeyword
 mkEmpty = TEmpty

main = do

   let spec = delims " \n\t" <> comment ";"
                             <> punct "{}()[]<>"
                             <> sq <> sqq
                             <> uw
                             <> keywords ["define","apply","+"]
     let code = [q|
       (define add (a b ) ; define simple function
         (+ a b) )
       (define r (add 10 20))
|]
     let toks = tokenize spec code :: [TTok]

   print toks

Notes

About the delimeter tokens

This type of tokens appears during a "delimited" formats processing and disappears in results. Currenly you will never see it unless normalization is turned off by 'nn' option.

The delimeters make sense in case of processing the CSV-like formats, but in this case you probably need only values in results.

This behavior may be changed later. But right now delimeters seem pointless in results. If you process some sort of grammar where delimeter character is important, you may use punctuation instead, i.e:

let spec = delims " \t"<>punct ",;()" <>emptyFields<>sq
tokenize spec "( delimeters , are , important, 'spaces are not');" :: [Text]

["(","delimeters",",","are",",","important",",","spaces are not",")",";"]

Other

For CSV-like formats it makes sense to split text to lines first, otherwise newline characters may cause to weird results

Authors

This library is written and maintained by Dmitry Zuikov, dzuikov@gmail.com