taxonomy_matcher package¶
Subpackages¶
Submodules¶
taxonomy_matcher.data_saver module¶
A Module to output the selected values to the chosen format
taxonomy_matcher.data_utils module¶
taxonomy_matcher.matched_phrase module¶
MatchedPhrase:
matched phrases found from text can be saved as MatchedPhrase object. Apart from the matched phrase itself, the object also add information like postion, matched code, and context info.
-
class
taxonomy_matcher.matched_phrase.
MatchedPhrase
(matched_pattern=None, surface_form=None, start_pos=None, end_pos=None, code_id=None, code_description=None, code_category=None, left_context=None, right_context=None, skill_likelihood=None)¶ Bases:
object
each matched phrase must have the following infor: - (normalized) matching pattern - surface form - start position in the original text - end position in the original text
depending on the input, it could have - code_id - code_description - category
taxonomy_matcher.matcher module¶
taxonomy_matcher.token_position module¶
Token Classes:
- various class to help store tokens with extra informations
- TokenizedPattern
- TokenizedMatch
-
class
taxonomy_matcher.token_position.
TokenizedMatch
(tokens, surface_form='', code_id=None)¶ Bases:
taxonomy_matcher.token_position.TokenizedPattern
- TokenizedMatch: tokenized patterns with codeid
attributes: - tokens: tokens from the pattern - code_id: the codeid of this pattern belong to
difference with TokenizedPattern: - each token in TokenizedMatch.tokens is a instance of TokenWithPos - token in TokenziedPattern.token is simply the token list without position
-
text_range
()¶
-
class
taxonomy_matcher.token_position.
TokenizedPattern
(tokens, surface_form='', code_id=None)¶ Bases:
object
- TokenizedPattern: tokenized patterns with codeid
- attributes: - tokens: tokens from the pattern - code_id: the codeid of this pattern belong to - skill_likelihood: the likelihood of this pattern to be a skill
-
pattern_form
()¶
taxonomy_matcher.token_trie module¶
TokenTrie:
Token based trie strcture. The trie is build from a sequence of tokens, and each token may has the position information from the origin text.
-
class
taxonomy_matcher.token_trie.
TokenTrie
(patterns=None)¶ Bases:
object
A basic trie class for token, used for quick token sequence searching
- Paramters:
- patterns: a list of tokenized pattern object
- end_token: a special string to mark the end of a pattern sequence
-
longest_match_at_position
(sub_trie, tokens)¶ get the last(natually to be the longest) matched phrases start at any fixed position
- params:
- sub_trie: token_trie or part of the token_trie
- tokens: tokens from the input text,
- output:
- all matched sequences
-
match_at_position
(sub_trie, tokens)¶ search all matched phrases start at fixed position
- params:
- sub_trie: token_trie or part of the token_trie
- tokens: tokens from the input text,
- output:
- all matched sequences