taxonomy_matcher package

Submodules

taxonomy_matcher.data_saver module

A Module to output the selected values to the chosen format

class taxonomy_matcher.data_saver.DataSaver(output_file)

Bases: object

DataSaver: - open/create targed output file - save the selected values to file with corresponding format

close_stream()

close the file

store(row)

Store one row at a time

taxonomy_matcher.data_utils module

taxonomy_matcher.matched_phrase module

MatchedPhrase:

matched phrases found from text can be saved as MatchedPhrase object. Apart from the matched phrase itself, the object also add information like postion, matched code, and context info.

class taxonomy_matcher.matched_phrase.MatchedPhrase(matched_pattern=None, surface_form=None, start_pos=None, end_pos=None, code_id=None, code_description=None, code_category=None, left_context=None, right_context=None, skill_likelihood=None)

Bases: object

each matched phrase must have the following infor: - (normalized) matching pattern - surface form - start position in the original text - end position in the original text

depending on the input, it could have - code_id - code_description - category

taxonomy_matcher.matcher module

taxonomy_matcher.token_position module

Token Classes:

various class to help store tokens with extra informations
  • TokenizedPattern
  • TokenizedMatch
class taxonomy_matcher.token_position.TokenizedMatch(tokens, surface_form='', code_id=None)

Bases: taxonomy_matcher.token_position.TokenizedPattern

TokenizedMatch: tokenized patterns with codeid

attributes: - tokens: tokens from the pattern - code_id: the codeid of this pattern belong to

difference with TokenizedPattern: - each token in TokenizedMatch.tokens is a instance of TokenWithPos - token in TokenziedPattern.token is simply the token list without position

text_range()
class taxonomy_matcher.token_position.TokenizedPattern(tokens, surface_form='', code_id=None)

Bases: object

TokenizedPattern: tokenized patterns with codeid
attributes: - tokens: tokens from the pattern - code_id: the codeid of this pattern belong to - skill_likelihood: the likelihood of this pattern to be a skill
pattern_form()

taxonomy_matcher.token_trie module

TokenTrie:

Token based trie strcture. The trie is build from a sequence of tokens, and each token may has the position information from the origin text.

class taxonomy_matcher.token_trie.TokenTrie(patterns=None)

Bases: object

A basic trie class for token, used for quick token sequence searching

Paramters:
  • patterns: a list of tokenized pattern object
  • end_token: a special string to mark the end of a pattern sequence
longest_match_at_position(sub_trie, tokens)

get the last(natually to be the longest) matched phrases start at any fixed position

params:
  • sub_trie: token_trie or part of the token_trie
  • tokens: tokens from the input text,
output:
  • all matched sequences
match_at_position(sub_trie, tokens)

search all matched phrases start at fixed position

params:
  • sub_trie: token_trie or part of the token_trie
  • tokens: tokens from the input text,
output:
  • all matched sequences

taxonomy_matcher.tokenizer module

Module contents

Top-level package for taxonomy-matcher

taxonomy_matcher.define_logger(mod_name)

Set the default logging configuration

taxonomy_matcher.set_logging_level(level=30)

Change logging level