cnlpt.cnlp_processors module¶

Module containing processor classes, evaluation metrics, and output modes for tasks defined in the library.

Add custom classes here to add new tasks to the library with the following steps:

Create a unique task_name for your task.
cnlp_output_modes – Add a mapping from a task name to a task type. Currently supported task types are sentence classification, tagging, relation extraction, and multi-task sentence classification.
Processor class – Create a subclass of transformers.DataProcessor for your data source. There are multiple examples to base off of, including intermediate abstractions like LabeledSentenceProcessor, RelationProcessor, SequenceProcessor, that simplify the implementation.
cnlp_processors – Add a mapping from your task name to the “processor” class you created in the last step.
(Optional) – Modify cnlp_compute_metrics() to add you task. If your task is classification a reasonable default will be used so this step would be optional.

cnlpt.cnlp_processors.cnlp_processors¶

Mapping from task names to processor classes

Type: Dict[str, transformers.DataProcessor]

cnlpt.cnlp_processors.cnlp_output_modes¶

Mapping from task names to output modes

Type: Dict[str, str]

cnlpt.cnlp_processors.tagging_metrics(task_name, preds, labels)¶

One of the metrics functions for use in cnlp_compute_metrics().

Generates evaluation metrics for sequence tagging tasks.

Ignores tags for which the true label is -100.

The returned dict is structured as follows:

{
    'acc': accuracy
    'token_f1': token-wise F1 score
    'f1': seqeval F1 score
    'report': seqeval classification report
}

Parameters

task_name (str) – the task name used to index into cnlp_processors
preds (numpy.ndarray) – the predicted labels from the model
labels (numpy.ndarray) – the true labels

Return type

Dict[str, Any]

Returns

a dictionary containing evaluation metrics

cnlpt.cnlp_processors.relation_metrics(task_name, preds, labels)¶

One of the metrics functions for use in cnlp_compute_metrics().

Generates evaluation metrics for relation extraction tasks.

Ignores tags for which the true label is -100.

The returned dict is structured as follows:

{
    'f1': F1 score
    'acc': accuracy
    'recall': recall
    'precision': precision
}

Parameters

task_name (str) – the task name used to index into cnlp_processors
preds (numpy.ndarray) – the predicted labels from the model
labels (numpy.ndarray) – the true labels

Return type

Dict[str, Any]

Returns

a dictionary containing evaluation metrics

cnlpt.cnlp_processors.acc_and_f1(preds, labels)¶

One of the metrics functions for use in cnlp_compute_metrics().

Generates evaluation metrics for generic tasks.

The returned dict is structured as follows:

{
    'acc': accuracy
    'f1': F1 score
    'acc_and_f1': mean of accuracy and F1 score
    'recall': recall
    'precision': precision
}

Parameters

preds (numpy.ndarray) – the predicted labels from the model
labels (numpy.ndarray) – the true labels

Return type

Dict[str, Any]

Returns

a dictionary containing evaluation metrics

cnlpt.cnlp_processors.cnlp_compute_metrics(task_name, preds, labels)¶

Function that defines and computes the metrics used for each task.

When adding a task definition to this file, add a branch to this function defining what its evaluation metric invocation should be. If the new task is a simple classification task, a sensible default is defined; falling back on this will trigger a warning.

Parameters

task_name (str) – the task name used to index into cnlp_processors
preds (numpy.ndarray) – the predicted labels from the model
labels (numpy.ndarray) – the true labels

Return type

Dict[str, Any]

Returns

a dictionary containing evaluation metrics

class cnlpt.cnlp_processors.CnlpProcessor¶

Bases: DataProcessor

Base class for single-task dataset processors

Parameters: downsampling (Optional[Dict[str, float]]) – downsampling values for class balance

__init__(downsampling=None)¶

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

get_example_from_tensor_dict(tensor_dict)¶

Gets an example from a dict with tensorflow tensors.

Parameters: tensor_dict – Keys and values should match the corresponding Glue tensorflow_dataset examples.

get_train_examples(data_dir)¶: Gets a collection of [InputExample] for the train set.

get_dev_examples(data_dir)¶: Gets a collection of [InputExample] for the dev set.

get_test_examples(data_dir)¶: Gets a collection of [InputExample] for the test set.

_create_examples(lines, set_type, sequence=False, relations=False)¶

This is an internal function, but it is included in the documentation to illustrate the input format for single-task datasets.

Creates examples for the training, dev and test sets from a headingless TSV file with one of the following structures:

For sequence classification:
```
label       text
```
For sequence tagging:
```
tag1 tag2 ... tagN  text
```

For relation tagging:

<source1,target1> , <source2,target2> , ... , <sourceN,targetN>     text

TODO: check that these formats are correct

class cnlpt.cnlp_processors.LabeledSentenceProcessor¶

Bases: CnlpProcessor

Base class for labeled sentence dataset processors

_create_examples(lines, set_type)¶

This is an internal function, but it is included in the documentation to illustrate the input format for single-task datasets.

Creates examples for the training, dev and test sets from a headingless TSV file with one of the following structures:

For sequence classification:
```
label       text
```
For sequence tagging:
```
tag1 tag2 ... tagN  text
```

For relation tagging:

<source1,target1> , <source2,target2> , ... , <sourceN,targetN>     text

TODO: check that these formats are correct

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.NegationProcessor¶

Bases: LabeledSentenceProcessor

Processor for the negation datasets

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.UncertaintyProcessor¶

Bases: LabeledSentenceProcessor

Processor for the negation datasets

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.HistoryProcessor¶

Bases: LabeledSentenceProcessor

Processor for the negation datasets

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.DtrProcessor¶

Bases: LabeledSentenceProcessor

Processor for DocTimeRel datasets

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.AlinkxProcessor¶

Bases: LabeledSentenceProcessor

Processor for an THYME ALINK dataset (links that describe change in temporal status of an event) The classifier version of the task is _given_ an event known to have some aspectual status, label that status.

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.AlinkProcessor¶

Bases: LabeledSentenceProcessor

Processor for an THYME ALINK dataset (links that describe change in temporal status of an event) The classifier version of the task is _given_ an event known to have some aspectual status, label that status.

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.ContainsProcessor¶

Bases: LabeledSentenceProcessor

Processor for narrative container relation (THYME). Describes the contains relation status between the two highlighted temporal entities (event or timex). NONE - no relation, CONTAINS - arg 1 contains arg2, CONTAINS-1 - arg 2 contains arg 1

get_labels()¶: Gets the list of labels for this data set.

class cnlpt.cnlp_processors.TlinkProcessor¶

Bases: LabeledSentenceProcessor

Processor for narrative container relation (THYME). Describes the contains relation status between the two highlighted temporal entities (event or timex). NONE - no relation, CONTAINS - arg 1 contains arg2, CONTAINS-1 - arg 2 contains arg 1

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.TimeCatProcessor¶

Bases: LabeledSentenceProcessor

Processor for a THYME time expression dataset The classifier version of the task is _given_ a time class, label its time category (see labels below).

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.ContextualModalityProcessor¶

Bases: LabeledSentenceProcessor

Processor for a contextual modality dataset

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.UciDrugSentimentProcessor¶

Bases: LabeledSentenceProcessor

Processor for the UCI Drug Review sentiment classification dataset

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.Mimic_7_Processor¶

Bases: LabeledSentenceProcessor

TODO: docstring

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.Mimic_3_Processor¶

Bases: LabeledSentenceProcessor

TODO: docstring

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.CovidProcessor¶

Bases: LabeledSentenceProcessor

TODO: docstring

get_labels()¶: Gets the list of labels for this data set.

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.RelationProcessor¶

Bases: CnlpProcessor

Base class for relation extraction dataset processors

_create_examples(lines, set_type)¶

This is an internal function, but it is included in the documentation to illustrate the input format for single-task datasets.

Creates examples for the training, dev and test sets from a headingless TSV file with one of the following structures:

For sequence classification:
```
label       text
```
For sequence tagging:
```
tag1 tag2 ... tagN  text
```

For relation tagging:

<source1,target1> , <source2,target2> , ... , <sourceN,targetN>     text

TODO: check that these formats are correct

class cnlpt.cnlp_processors.TlinkRelationProcessor¶

Bases: RelationProcessor

TODO: docstring

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

get_labels()¶: Gets the list of labels for this data set.

class cnlpt.cnlp_processors.SequenceProcessor¶

Bases: CnlpProcessor

Base class for sequence tagging dataset processors

_create_examples(lines, set_type)¶

This is an internal function, but it is included in the documentation to illustrate the input format for single-task datasets.

Creates examples for the training, dev and test sets from a headingless TSV file with one of the following structures:

For sequence classification:
```
label       text
```
For sequence tagging:
```
tag1 tag2 ... tagN  text
```

For relation tagging:

<source1,target1> , <source2,target2> , ... , <sourceN,targetN>     text

TODO: check that these formats are correct

class cnlpt.cnlp_processors.TimexProcessor¶

Bases: SequenceProcessor

TODO: docstring

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

get_labels()¶: Gets the list of labels for this data set.

class cnlpt.cnlp_processors.EventProcessor¶

Bases: SequenceProcessor

TODO: docstring

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

get_labels()¶: Gets the list of labels for this data set.

class cnlpt.cnlp_processors.DpheProcessor¶

Bases: SequenceProcessor

TODO: docstring

get_one_score(results)¶

Return a single value to use as the score for selecting the best model epoch after training.

Parameters: results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch
Returns: a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

get_labels()¶: Gets the list of labels for this data set.

class cnlpt.cnlp_processors.MTLClassifierProcessor¶

Bases: DataProcessor

Base class for multi-task learning classification dataset processors

get_classifiers()¶

Get the list of classification subtasks in this multi-task setting

Return type: List[str]
Returns: a list of task names

get_num_tasks()¶

Get the number of subtasks in this multi-task setting.

Equivalent to len(self.get_classifiers()).

Return type: int
Returns: the number of subtasks

get_classifier_id()¶

Get the classifier ID name used in building the GUIDs for the transformers.InputExample instances.

Not necessarily equal to the task_name used as keys for cnlp_processors and cnlp_output_modes.

Return type: str
Returns: the value of the classifier ID

get_default_label()¶

Get the default label to assign to unlabeled instances in the dataset.

Return type: str
Returns: the value of the default label

get_example_from_tensor_dict(tensor_dict)¶: Not used.

get_train_examples(data_dir)¶: Gets a collection of [InputExample] for the train set.

get_dev_examples(data_dir)¶: Gets a collection of [InputExample] for the dev set.

get_test_examples(data_dir)¶: Gets a collection of [InputExample] for the test set.

_get_json_examples(fn, set_type)¶

This is an internal function, but it is included in the documentation to illustrate the input format for MTL datasets.

Creates examples for the training, dev and test sets from a JSON file with the following structure:

{
    "<guid_1>": {
        "text": "<text>",
        "labels: {
            "<task_1>": "<label>",
            ...
        }
    },
    ...
}

Parameters

fn (str) – the path to the dataset file to load
set_type (str) – the type of split the file contains (e.g. train, dev, test)

Return type

List[transformers.InputExample]

Returns

the examples loaded from the file

class cnlpt.cnlp_processors.MimicRadiProcessor¶

Bases: MTLClassifierProcessor

TODO: docstring

get_classifiers()¶

Get the list of classification subtasks in this multi-task setting

Return type: List[str]
Returns: a list of task names

get_labels()¶: Gets the list of labels for this data set.

get_classifier_id()¶

Get the classifier ID name used in building the GUIDs for the transformers.InputExample instances.

Not necessarily equal to the task_name used as keys for cnlp_processors and cnlp_output_modes.

Return type: str
Returns: the value of the classifier ID

get_default_label()¶

Get the default label to assign to unlabeled instances in the dataset.

Return type: str
Returns: the value of the default label

class cnlpt.cnlp_processors.i2b22008Processor¶

Bases: MTLClassifierProcessor

Processor for the i2b2-2008 disease classification dataset

get_classifiers()¶

Get the list of classification subtasks in this multi-task setting

Return type: List[str]
Returns: a list of task names

get_labels()¶: Gets the list of labels for this data set.

get_default_label()¶

Get the default label to assign to unlabeled instances in the dataset.

Return type: str
Returns: the value of the default label

get_classifier_id()¶

Get the classifier ID name used in building the GUIDs for the transformers.InputExample instances.

Not necessarily equal to the task_name used as keys for cnlp_processors and cnlp_output_modes.

Return type: str
Returns: the value of the classifier ID

cnlpt.cnlp_processors.relex = 'relations'¶: Add output modes for new tasks here.