Cheshire3 Object Model - Tokenizer¶
API¶
- class cheshire3.baseObjects.Tokenizer(session, config, parent=None)[source]¶
A Tokenizer takes a string and returns an ordered list of tokens.
A Tokenizer takes a string of language and processes it to produce an ordered list of tokens.
Example Tokenizers might extract keywords by splitting on whitespace, or by identifying common word forms using a regular expression.
The incoming string is often in a data structure (dictionary / hash / associative array), as per output from Extractor.
Implementations¶
The following implementations are included in the distribution by default:
- class cheshire3.tokenizer.RegexpSubTokenizer(session, config, parent)[source]¶
Substitute regex matches with a character, then split on whitespace.
A Tokenizer that replaces regular expression matches in the data with a configurable character (defaults to whitespace), then splits the result at whitespace.
- class cheshire3.tokenizer.RegexpSplitTokenizer(session, config, parent)[source]¶
A Tokenizer that simply splits at the regex matches.
- class cheshire3.tokenizer.RegexpFindTokenizer(session, config, parent)[source]¶
A tokenizer that returns all words that match the regex.
- class cheshire3.tokenizer.RegexpFindOffsetTokenizer(session, config, parent)[source]¶
Find tokens that match regex with character offsets.
A Tokenizer that returns all words that match the regex, and also the character offset at which each word occurs.
- class cheshire3.tokenizer.LineTokenizer(session, config, parent)[source]¶
Trivial but potentially useful Tokenizer to split data on whitespace.
- class cheshire3.tokenizer.DateTokenizer(session, config, parent)[source]¶
Tokenizer to identify date tokens, and return only these.
Capable of extracting multiple dates, but slowly and less reliably than single ones.
- class cheshire3.tokenizer.DateRangeTokenizer(session, config, parent)[source]¶
Tokenizer to identify ranges of date tokens, and return only these.
e.g.
>>> self.process_string(session, '2003/2004') ['2003-01-01T00:00:00', '2004-12-31T23:59:59.999999'] >>> self.process_string(session, '2003-2004') ['2003-01-01T00:00:00', '2004-12-31T23:59:59.999999'] >>> self.process_string(session, '2003 2004') ['2003-01-01T00:00:00', '2004-12-31T23:59:59.999999'] >>> self.process_string(session, '2003 to 2004') ['2003-01-01T00:00:00', '2004-12-31T23:59:59.999999']
For single dates, attempts to expand this into the largest possible range that the data could specify. e.g. 1902-04 means the whole of April 1902.
>>> self.process_string(session, "1902-04") ['1902-04-01T00:00:00', '1902-04-30T23:59:59.999999']