cnlpt.cnlp_data module¶
- class cnlpt.cnlp_data.Split¶
Bases:
EnumEnum representing the three data splits for model development.
- class cnlpt.cnlp_data.InputFeatures¶
Bases:
objectA single set of features of data. Property names are the same names as the corresponding inputs to a model.
- Parameters
input_ids – Indices of input sequence tokens in the vocabulary.
attention_mask – Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]: Usually1for tokens that are NOT MASKED,0for MASKED (padded) tokens.token_type_ids – (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them.
event_tokens – (Optional)
label – (Optional) Label corresponding to the input. Int for classification problems, float for regression problems.
- to_json_string()¶
Serializes this instance to a JSON string.
- __init__(input_ids, attention_mask=None, token_type_ids=None, event_tokens=None, label=None)¶
- class cnlpt.cnlp_data.HierarchicalInputFeatures¶
Bases:
objectA single set of features of data for the hierarchical model. Property names are the same names as the corresponding inputs to a model.
- Parameters
input_ids – Indices of input sequence tokens in the vocabulary.
attention_mask – Mask to avoid performing attention on padding token indices. Mask values selected in
[0, 1]: Usually1for tokens that are NOT MASKED,0for MASKED (padded) tokens.token_type_ids – (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them.
event_tokens – (Optional)
label – (Optional) Label corresponding to the input. Int for classification problems, float for regression problems.
- to_json_string()¶
Serializes this instance to a JSON string.
- __init__(input_ids, attention_mask=None, token_type_ids=None, event_tokens=None, label=None)¶
- cnlpt.cnlp_data.cnlp_convert_features_to_hierarchical(features, chunk_len, num_chunks, cls_id, sep_id, pad_id, insert_empty_chunk_at_beginning=False)¶
Chunk an instance of InputFeatures into an instance of HierarchicalInputFeatures for the hierarchical model.
- Parameters
features (InputFeatures) – the old instance
chunk_len (int) – the maximum length of a chunk
num_chunks (int) – the maximum number of chunks in the instance
cls_id (int) – the tokenizer’s ID representing the CLS token
sep_id (int) – the tokenizer’s ID representing the SEP token
pad_id (int) – the tokenizer’s ID representing the PAD token
insert_empty_chunk_at_beginning (bool) – whether to insert an empty chunk at the beginning of the instance
- Return type
- Returns
an instance of HierarchicalInputFeatures containing the chunked instance
- cnlpt.cnlp_data.cnlp_convert_examples_to_features(examples, tokenizer, max_length=None, task=None, label_list=None, output_mode=None, token_classify=False, inference=False, hierarchical=False, chunk_len=- 1, num_chunks=- 1, insert_empty_chunk_at_beginning=False, truncate_examples=False)¶
Processes the list of
transformers.InputExamplegenerated by the processor defined incnlpt.cnlp_processors.cnlp_processorsand converts the examples into a list ofInputFeaturesorHierarchicalInputFeatures, depending on the model.- Parameters
examples (List[transformers.data.processors.utils.InputExample]) – the list of examples to convert
tokenizer (transformers.tokenization_utils.PreTrainedTokenizer) – the tokenizer
max_length (Optional[int]) – the maximum sequence length at which to truncate examples
task (str) – the task name
label_list (Optional[List[str]]) – the list of labels for this task. If not provided explicitly, it will be retrieved from the processor with
transformers.DataProcessor.get_labels().output_mode (Optional[str]) – the output mode for this task. If not provided explicitly, it will be retrieved from
cnlpt.cnlp_processors.cnlp_output_modes.token_classify (bool) – TODO define
inference (bool) – TODO define
hierarchical (bool) – whether to structure the data for the hierarchical model (
cnlpt.HierarchicalTransformer.HierarchicalModel)chunk_len (int) – for the hierarchical model, the length of each chunk in tokens
num_chunks (int) – for the hierarchical model, the number of chunks
insert_empty_chunk_at_beginning (bool) – for the hierarchical model, whether to insert an empty chunk at the beginning of the list of chunks (equivalent in theory to a CLS chunk).
truncate_examples (bool) – whether to truncate the string representation of the example instances printed to the log
- Return type
- Returns
the list of converted input features
- cnlpt.cnlp_data.truncate_features(feature)¶
Method to produce a truncated string representation of a feature.
- Parameters
feature (Union[InputFeatures, HierarchicalInputFeatures]) – the feature to represent
- Return type
- Returns
the truncated representation of the feature
- class cnlpt.cnlp_data.DataTrainingArguments¶
Bases:
objectArguments pertaining to what data we are going to input our model for training and eval.
Using
transformers.HfArgumentParserwe can turn this class into argparse arguments to be able to specify them on the command line.- __init__(data_dir, task_name=<factory>, max_seq_length=128, overwrite_cache=False, weight_classes=False, chunk_len=None, num_chunks=None, insert_empty_chunk_at_beginning=False, truncate_examples=False)¶
- class cnlpt.cnlp_data.ClinicalNlpDataset¶
Bases:
DatasetCopy-pasted from GlueDataset with glue task-specific code changed; moved into here to be self-contained.
- Parameters
args (DataTrainingArguments) – the data training args for this experiment
tokenizer (transformers.tokenization_utils.PreTrainedTokenizer) – the tokenizer
limit_length (Optional[int]) – if provided, the number of examples to include in the dataset
mode (Union[str, Split]) – the data split mode of this dataset (
"train","dev","test")cache_dir (Optional[str]) – if provided, the directory to save/load a cache of this dataset
hierarchical (bool) – whether to structure the data for the hierarchical model (
cnlpt.HierarchicalTransformer.HierarchicalModel)
- __init__(args, tokenizer, limit_length=None, mode=Split.train, cache_dir=None, hierarchical=False)¶