cnlpt.cnlp_data module

class cnlpt.cnlp_data.Split

Bases: Enum

Enum representing the three data splits for model development.

class cnlpt.cnlp_data.InputFeatures

Bases: object

A single set of features of data. Property names are the same names as the corresponding inputs to a model.

Parameters
  • input_ids – Indices of input sequence tokens in the vocabulary.

  • attention_mask – Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: Usually 1 for tokens that are NOT MASKED, 0 for MASKED (padded) tokens.

  • token_type_ids – (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them.

  • event_tokens – (Optional)

  • label – (Optional) Label corresponding to the input. Int for classification problems, float for regression problems.

to_json_string()

Serializes this instance to a JSON string.

__init__(input_ids, attention_mask=None, token_type_ids=None, event_tokens=None, label=None)
class cnlpt.cnlp_data.HierarchicalInputFeatures

Bases: object

A single set of features of data for the hierarchical model. Property names are the same names as the corresponding inputs to a model.

Parameters
  • input_ids – Indices of input sequence tokens in the vocabulary.

  • attention_mask – Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]: Usually 1 for tokens that are NOT MASKED, 0 for MASKED (padded) tokens.

  • token_type_ids – (Optional) Segment token indices to indicate first and second portions of the inputs. Only some models use them.

  • event_tokens – (Optional)

  • label – (Optional) Label corresponding to the input. Int for classification problems, float for regression problems.

to_json_string()

Serializes this instance to a JSON string.

__init__(input_ids, attention_mask=None, token_type_ids=None, event_tokens=None, label=None)
cnlpt.cnlp_data.cnlp_convert_features_to_hierarchical(features, chunk_len, num_chunks, cls_id, sep_id, pad_id, insert_empty_chunk_at_beginning=False)

Chunk an instance of InputFeatures into an instance of HierarchicalInputFeatures for the hierarchical model.

Parameters
  • features (InputFeatures) – the old instance

  • chunk_len (int) – the maximum length of a chunk

  • num_chunks (int) – the maximum number of chunks in the instance

  • cls_id (int) – the tokenizer’s ID representing the CLS token

  • sep_id (int) – the tokenizer’s ID representing the SEP token

  • pad_id (int) – the tokenizer’s ID representing the PAD token

  • insert_empty_chunk_at_beginning (bool) – whether to insert an empty chunk at the beginning of the instance

Return type

HierarchicalInputFeatures

Returns

an instance of HierarchicalInputFeatures containing the chunked instance

cnlpt.cnlp_data.cnlp_convert_examples_to_features(examples, tokenizer, max_length=None, task=None, label_list=None, output_mode=None, token_classify=False, inference=False, hierarchical=False, chunk_len=- 1, num_chunks=- 1, insert_empty_chunk_at_beginning=False, truncate_examples=False)

Processes the list of transformers.InputExample generated by the processor defined in cnlpt.cnlp_processors.cnlp_processors and converts the examples into a list of InputFeatures or HierarchicalInputFeatures, depending on the model.

Parameters
  • examples (List[transformers.data.processors.utils.InputExample]) – the list of examples to convert

  • tokenizer (transformers.tokenization_utils.PreTrainedTokenizer) – the tokenizer

  • max_length (Optional[int]) – the maximum sequence length at which to truncate examples

  • task (str) – the task name

  • label_list (Optional[List[str]]) – the list of labels for this task. If not provided explicitly, it will be retrieved from the processor with transformers.DataProcessor.get_labels().

  • output_mode (Optional[str]) – the output mode for this task. If not provided explicitly, it will be retrieved from cnlpt.cnlp_processors.cnlp_output_modes.

  • token_classify (bool) – TODO define

  • inference (bool) – TODO define

  • hierarchical (bool) – whether to structure the data for the hierarchical model (cnlpt.HierarchicalTransformer.HierarchicalModel)

  • chunk_len (int) – for the hierarchical model, the length of each chunk in tokens

  • num_chunks (int) – for the hierarchical model, the number of chunks

  • insert_empty_chunk_at_beginning (bool) – for the hierarchical model, whether to insert an empty chunk at the beginning of the list of chunks (equivalent in theory to a CLS chunk).

  • truncate_examples (bool) – whether to truncate the string representation of the example instances printed to the log

Return type

Union[List[InputFeatures], List[HierarchicalInputFeatures]]

Returns

the list of converted input features

cnlpt.cnlp_data.truncate_features(feature)

Method to produce a truncated string representation of a feature.

Parameters

feature (Union[InputFeatures, HierarchicalInputFeatures]) – the feature to represent

Return type

str

Returns

the truncated representation of the feature

class cnlpt.cnlp_data.DataTrainingArguments

Bases: object

Arguments pertaining to what data we are going to input our model for training and eval.

Using transformers.HfArgumentParser we can turn this class into argparse arguments to be able to specify them on the command line.

__init__(data_dir, task_name=<factory>, max_seq_length=128, overwrite_cache=False, weight_classes=False, chunk_len=None, num_chunks=None, insert_empty_chunk_at_beginning=False, truncate_examples=False)
class cnlpt.cnlp_data.ClinicalNlpDataset

Bases: Dataset

Copy-pasted from GlueDataset with glue task-specific code changed; moved into here to be self-contained.

Parameters
__init__(args, tokenizer, limit_length=None, mode=Split.train, cache_dir=None, hierarchical=False)
get_labels()

Retrieve the label lists for all the tasks for the dataset.

Return type

List[List[str]]

Returns

the list of label lists