taxonomy_matcher package¶

Subpackages¶

taxonomy_matcher.match_patterns package

Submodules¶

taxonomy_matcher.data_saver module¶

A Module to output the selected values to the chosen format

class taxonomy_matcher.data_saver.DataSaver(output_file)¶

Bases: object

DataSaver: - open/create targed output file - save the selected values to file with corresponding format

close_stream()¶: close the file

store(row)¶: Store one row at a time

taxonomy_matcher.data_utils module¶

taxonomy_matcher.matched_phrase module¶

MatchedPhrase:

matched phrases found from text can be saved as MatchedPhrase object. Apart from the matched phrase itself, the object also add information like postion, matched code, and context info.

class taxonomy_matcher.matched_phrase.MatchedPhrase(matched_pattern=None, surface_form=None, start_pos=None, end_pos=None, code_id=None, code_description=None, code_category=None, left_context=None, right_context=None, skill_likelihood=None)¶

Bases: object

each matched phrase must have the following infor: - (normalized) matching pattern - surface form - start position in the original text - end position in the original text

depending on the input, it could have - code_id - code_description - category

taxonomy_matcher.matcher module¶

taxonomy_matcher.token_position module¶

Token Classes:

various class to help store tokens with extra informations

TokenizedPattern
TokenizedMatch

class taxonomy_matcher.token_position.TokenizedMatch(tokens, surface_form='', code_id=None)¶

Bases: taxonomy_matcher.token_position.TokenizedPattern

TokenizedMatch: tokenized patterns with codeid

attributes: - tokens: tokens from the pattern - code_id: the codeid of this pattern belong to

difference with TokenizedPattern: - each token in TokenizedMatch.tokens is a instance of TokenWithPos - token in TokenziedPattern.token is simply the token list without position

text_range()¶

class taxonomy_matcher.token_position.TokenizedPattern(tokens, surface_form='', code_id=None)¶

Bases: object

TokenizedPattern: tokenized patterns with codeid: attributes: - tokens: tokens from the pattern - code_id: the codeid of this pattern belong to - skill_likelihood: the likelihood of this pattern to be a skill

pattern_form()¶

taxonomy_matcher.token_trie module¶

TokenTrie:

Token based trie strcture. The trie is build from a sequence of tokens, and each token may has the position information from the origin text.

class taxonomy_matcher.token_trie.TokenTrie(patterns=None)¶

Bases: object

A basic trie class for token, used for quick token sequence searching

Paramters:

patterns: a list of tokenized pattern object
end_token: a special string to mark the end of a pattern sequence

longest_match_at_position(sub_trie, tokens)¶

get the last(natually to be the longest) matched phrases start at any fixed position

params:

sub_trie: token_trie or part of the token_trie
tokens: tokens from the input text,

output:

all matched sequences

match_at_position(sub_trie, tokens)¶

search all matched phrases start at fixed position

params:

sub_trie: token_trie or part of the token_trie
tokens: tokens from the input text,

output:

all matched sequences

taxonomy_matcher.tokenizer module¶

Module contents¶

Top-level package for taxonomy-matcher

taxonomy_matcher.define_logger(mod_name)¶: Set the default logging configuration

taxonomy_matcher.set_logging_level(level=30)¶: Change logging level