cnlpt.cnlp_processors module

Module containing processor classes, evaluation metrics, and output modes for tasks defined in the library.

Add custom classes here to add new tasks to the library with the following steps:

  1. Create a unique task_name for your task.

  2. cnlp_output_modes – Add a mapping from a task name to a task type. Currently supported task types are sentence classification, tagging, relation extraction, and multi-task sentence classification.

  3. Processor class – Create a subclass of transformers.DataProcessor for your data source. There are multiple examples to base off of, including intermediate abstractions like LabeledSentenceProcessor, RelationProcessor, SequenceProcessor, that simplify the implementation.

  4. cnlp_processors – Add a mapping from your task name to the “processor” class you created in the last step.

  5. (Optional) – Modify cnlp_compute_metrics() to add you task. If your task is classification a reasonable default will be used so this step would be optional.

cnlpt.cnlp_processors.cnlp_processors

Mapping from task names to processor classes

Type

Dict[str, transformers.DataProcessor]

cnlpt.cnlp_processors.cnlp_output_modes

Mapping from task names to output modes

Type

Dict[str, str]

cnlpt.cnlp_processors.tagging_metrics(task_name, preds, labels)

One of the metrics functions for use in cnlp_compute_metrics().

Generates evaluation metrics for sequence tagging tasks.

Ignores tags for which the true label is -100.

The returned dict is structured as follows:

{
    'acc': accuracy
    'token_f1': token-wise F1 score
    'f1': seqeval F1 score
    'report': seqeval classification report
}
Parameters
  • task_name (str) – the task name used to index into cnlp_processors

  • preds (numpy.ndarray) – the predicted labels from the model

  • labels (numpy.ndarray) – the true labels

Return type

Dict[str, Any]

Returns

a dictionary containing evaluation metrics

cnlpt.cnlp_processors.relation_metrics(task_name, preds, labels)

One of the metrics functions for use in cnlp_compute_metrics().

Generates evaluation metrics for relation extraction tasks.

Ignores tags for which the true label is -100.

The returned dict is structured as follows:

{
    'f1': F1 score
    'acc': accuracy
    'recall': recall
    'precision': precision
}
Parameters
  • task_name (str) – the task name used to index into cnlp_processors

  • preds (numpy.ndarray) – the predicted labels from the model

  • labels (numpy.ndarray) – the true labels

Return type

Dict[str, Any]

Returns

a dictionary containing evaluation metrics

cnlpt.cnlp_processors.acc_and_f1(preds, labels)

One of the metrics functions for use in cnlp_compute_metrics().

Generates evaluation metrics for generic tasks.

The returned dict is structured as follows:

{
    'acc': accuracy
    'f1': F1 score
    'acc_and_f1': mean of accuracy and F1 score
    'recall': recall
    'precision': precision
}
Parameters
Return type

Dict[str, Any]

Returns

a dictionary containing evaluation metrics

cnlpt.cnlp_processors.cnlp_compute_metrics(task_name, preds, labels)

Function that defines and computes the metrics used for each task.

When adding a task definition to this file, add a branch to this function defining what its evaluation metric invocation should be. If the new task is a simple classification task, a sensible default is defined; falling back on this will trigger a warning.

Parameters
  • task_name (str) – the task name used to index into cnlp_processors

  • preds (numpy.ndarray) – the predicted labels from the model

  • labels (numpy.ndarray) – the true labels

Return type

Dict[str, Any]

Returns

a dictionary containing evaluation metrics

class cnlpt.cnlp_processors.CnlpProcessor

Bases: DataProcessor

Base class for single-task dataset processors

Parameters

downsampling (Optional[Dict[str, float]]) – downsampling values for class balance

__init__(downsampling=None)
get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

get_example_from_tensor_dict(tensor_dict)

Gets an example from a dict with tensorflow tensors.

Parameters

tensor_dict – Keys and values should match the corresponding Glue tensorflow_dataset examples.

get_train_examples(data_dir)

Gets a collection of [InputExample] for the train set.

get_dev_examples(data_dir)

Gets a collection of [InputExample] for the dev set.

get_test_examples(data_dir)

Gets a collection of [InputExample] for the test set.

_create_examples(lines, set_type, sequence=False, relations=False)

This is an internal function, but it is included in the documentation to illustrate the input format for single-task datasets.

Creates examples for the training, dev and test sets from a headingless TSV file with one of the following structures:

  • For sequence classification:

    label       text
    
  • For sequence tagging:

    tag1 tag2 ... tagN  text
    
  • For relation tagging:

    <source1,target1> , <source2,target2> , ... , <sourceN,targetN>     text
    

TODO: check that these formats are correct

class cnlpt.cnlp_processors.LabeledSentenceProcessor

Bases: CnlpProcessor

Base class for labeled sentence dataset processors

_create_examples(lines, set_type)

This is an internal function, but it is included in the documentation to illustrate the input format for single-task datasets.

Creates examples for the training, dev and test sets from a headingless TSV file with one of the following structures:

  • For sequence classification:

    label       text
    
  • For sequence tagging:

    tag1 tag2 ... tagN  text
    
  • For relation tagging:

    <source1,target1> , <source2,target2> , ... , <sourceN,targetN>     text
    

TODO: check that these formats are correct

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.NegationProcessor

Bases: LabeledSentenceProcessor

Processor for the negation datasets

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.UncertaintyProcessor

Bases: LabeledSentenceProcessor

Processor for the negation datasets

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.HistoryProcessor

Bases: LabeledSentenceProcessor

Processor for the negation datasets

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.DtrProcessor

Bases: LabeledSentenceProcessor

Processor for DocTimeRel datasets

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.AlinkxProcessor

Bases: LabeledSentenceProcessor

Processor for an THYME ALINK dataset (links that describe change in temporal status of an event) The classifier version of the task is _given_ an event known to have some aspectual status, label that status.

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.AlinkProcessor

Bases: LabeledSentenceProcessor

Processor for an THYME ALINK dataset (links that describe change in temporal status of an event) The classifier version of the task is _given_ an event known to have some aspectual status, label that status.

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.ContainsProcessor

Bases: LabeledSentenceProcessor

Processor for narrative container relation (THYME). Describes the contains relation status between the two highlighted temporal entities (event or timex). NONE - no relation, CONTAINS - arg 1 contains arg2, CONTAINS-1 - arg 2 contains arg 1

get_labels()

Gets the list of labels for this data set.

class cnlpt.cnlp_processors.TlinkProcessor

Bases: LabeledSentenceProcessor

Processor for narrative container relation (THYME). Describes the contains relation status between the two highlighted temporal entities (event or timex). NONE - no relation, CONTAINS - arg 1 contains arg2, CONTAINS-1 - arg 2 contains arg 1

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.TimeCatProcessor

Bases: LabeledSentenceProcessor

Processor for a THYME time expression dataset The classifier version of the task is _given_ a time class, label its time category (see labels below).

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.ContextualModalityProcessor

Bases: LabeledSentenceProcessor

Processor for a contextual modality dataset

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.UciDrugSentimentProcessor

Bases: LabeledSentenceProcessor

Processor for the UCI Drug Review sentiment classification dataset

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.Mimic_7_Processor

Bases: LabeledSentenceProcessor

TODO: docstring

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.Mimic_3_Processor

Bases: LabeledSentenceProcessor

TODO: docstring

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.CovidProcessor

Bases: LabeledSentenceProcessor

TODO: docstring

get_labels()

Gets the list of labels for this data set.

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

class cnlpt.cnlp_processors.RelationProcessor

Bases: CnlpProcessor

Base class for relation extraction dataset processors

_create_examples(lines, set_type)

This is an internal function, but it is included in the documentation to illustrate the input format for single-task datasets.

Creates examples for the training, dev and test sets from a headingless TSV file with one of the following structures:

  • For sequence classification:

    label       text
    
  • For sequence tagging:

    tag1 tag2 ... tagN  text
    
  • For relation tagging:

    <source1,target1> , <source2,target2> , ... , <sourceN,targetN>     text
    

TODO: check that these formats are correct

class cnlpt.cnlp_processors.TlinkRelationProcessor

Bases: RelationProcessor

TODO: docstring

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

get_labels()

Gets the list of labels for this data set.

class cnlpt.cnlp_processors.SequenceProcessor

Bases: CnlpProcessor

Base class for sequence tagging dataset processors

_create_examples(lines, set_type)

This is an internal function, but it is included in the documentation to illustrate the input format for single-task datasets.

Creates examples for the training, dev and test sets from a headingless TSV file with one of the following structures:

  • For sequence classification:

    label       text
    
  • For sequence tagging:

    tag1 tag2 ... tagN  text
    
  • For relation tagging:

    <source1,target1> , <source2,target2> , ... , <sourceN,targetN>     text
    

TODO: check that these formats are correct

class cnlpt.cnlp_processors.TimexProcessor

Bases: SequenceProcessor

TODO: docstring

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

get_labels()

Gets the list of labels for this data set.

class cnlpt.cnlp_processors.EventProcessor

Bases: SequenceProcessor

TODO: docstring

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

get_labels()

Gets the list of labels for this data set.

class cnlpt.cnlp_processors.DpheProcessor

Bases: SequenceProcessor

TODO: docstring

get_one_score(results)

Return a single value to use as the score for selecting the best model epoch after training.

Parameters

results (Dict[str, Any]) – the dictionary of evaluation metrics for the current epoch

Returns

a single value; it needs to be of a type that can be ordered (preferably, but not necessarily, a float).

get_labels()

Gets the list of labels for this data set.

class cnlpt.cnlp_processors.MTLClassifierProcessor

Bases: DataProcessor

Base class for multi-task learning classification dataset processors

get_classifiers()

Get the list of classification subtasks in this multi-task setting

Return type

List[str]

Returns

a list of task names

get_num_tasks()

Get the number of subtasks in this multi-task setting.

Equivalent to len(self.get_classifiers()).

Return type

int

Returns

the number of subtasks

get_classifier_id()

Get the classifier ID name used in building the GUIDs for the transformers.InputExample instances.

Not necessarily equal to the task_name used as keys for cnlp_processors and cnlp_output_modes.

Return type

str

Returns

the value of the classifier ID

get_default_label()

Get the default label to assign to unlabeled instances in the dataset.

Return type

str

Returns

the value of the default label

get_example_from_tensor_dict(tensor_dict)

Not used.

get_train_examples(data_dir)

Gets a collection of [InputExample] for the train set.

get_dev_examples(data_dir)

Gets a collection of [InputExample] for the dev set.

get_test_examples(data_dir)

Gets a collection of [InputExample] for the test set.

_get_json_examples(fn, set_type)

This is an internal function, but it is included in the documentation to illustrate the input format for MTL datasets.

Creates examples for the training, dev and test sets from a JSON file with the following structure:

{
    "<guid_1>": {
        "text": "<text>",
        "labels: {
            "<task_1>": "<label>",
            ...
        }
    },
    ...
}
Parameters
  • fn (str) – the path to the dataset file to load

  • set_type (str) – the type of split the file contains (e.g. train, dev, test)

Return type

List[transformers.InputExample]

Returns

the examples loaded from the file

class cnlpt.cnlp_processors.MimicRadiProcessor

Bases: MTLClassifierProcessor

TODO: docstring

get_classifiers()

Get the list of classification subtasks in this multi-task setting

Return type

List[str]

Returns

a list of task names

get_labels()

Gets the list of labels for this data set.

get_classifier_id()

Get the classifier ID name used in building the GUIDs for the transformers.InputExample instances.

Not necessarily equal to the task_name used as keys for cnlp_processors and cnlp_output_modes.

Return type

str

Returns

the value of the classifier ID

get_default_label()

Get the default label to assign to unlabeled instances in the dataset.

Return type

str

Returns

the value of the default label

class cnlpt.cnlp_processors.i2b22008Processor

Bases: MTLClassifierProcessor

Processor for the i2b2-2008 disease classification dataset

get_classifiers()

Get the list of classification subtasks in this multi-task setting

Return type

List[str]

Returns

a list of task names

get_labels()

Gets the list of labels for this data set.

get_default_label()

Get the default label to assign to unlabeled instances in the dataset.

Return type

str

Returns

the value of the default label

get_classifier_id()

Get the classifier ID name used in building the GUIDs for the transformers.InputExample instances.

Not necessarily equal to the task_name used as keys for cnlp_processors and cnlp_output_modes.

Return type

str

Returns

the value of the classifier ID

cnlpt.cnlp_processors.relex = 'relations'

Add output modes for new tasks here.