Information Extraction

Covid-19 MLIA Eval

Task Description

The goal of the Information Extraction task is to identify medical information in texts. We defined six major types of entities to be identified. Those categories are mainly related to the Covid-19 issue. The main objective is to mine texts in order to access relevant information concerning the Covid-19, and more specifically information that may help the health professional to find outcomes.

During the first round of this task, participants will have access only to unannotated data (namely, the data collected from the two other tasks) in a plain text format. The evaluation will consist in a rover of system outputs. We encourage the participants to try experimental methods and to submit several system outputs in order to exchange different views during the discussion at the virtual meeting.

To participate in the Information Extraction task, the groups need to register at the following link:

Register

Important Dates - Round 2

Round starts: June 14, 2021

Corpora released: June 14, 2021

Runs due from participants (BRAT format): October 15, 2021

Ground-truth released and runs scored: October 22, 2021

Rolling report submission deadline (camera ready): November 19, 2021

Slot for a virtual meeting to discuss the results: November 30-December 2, 2021

Round ends: December 2, 2021

Participation Guidelines

In this information extraction task, participants are expected to identify entities belonging to six categories of entities (the tag to be used in the outputs is shown in a box at the beginning of each line):

Contrary to traditional named entities which generally fit short spans of text, this task may concern both short and long spans of text to be annotated.

Corpora

Description

For round 2, training and testing datasets are composed of files available in four languages (English, French, German, and Spanish). Two types of content are provided:

Manual annotations have been made on scientific abstracts (DE, EN, ES, FR) and a few news files (EN and FR only), see statistics below. We plan to release annotations for DE and ES on a few news files in the future weeks.

Statistics

On the training dataset from round #2:

Participant Repository:

Participants are provided with a single repository for all the tasks they take part in. The repository contains the runs, resources, code, and report of each participant.

The repository is organised as follows:

Covid-19 MLIA Eval consists of three tasks run in three rounds. Therefore, the submission and score folders are organized into sub-folders for each task and round as follows:

Participants which do not take part in a given task or round can simply delete the corresponding sub-folders.

The goal of Covid-19 MLIA Eval is to speed up the creation of multilingual information acces systems and (language) resources for Covid-19 as well as openly share these systems and resources as much as possible. Therefore, participants are more than encouraged to share their code and any additional (language) resources they have used or created.

All the contents of these repositories are realeased under the Creative Commons Attribution-ShareAlike 4.0 International License.

Task Repository:

Organizers share contents common to all participants through the Information Extraction task repository.

The repository is organised as follows:

Covid-19 MLIA Eval runs in three rounds. Therefore, the topics and ground-truth folders are organized into sub-folders for each round, i.e. round1, round2, and round3.

All the contents of this repository are realeased under the Creative Commons Attribution-ShareAlike 4.0 International License.

Rolling Technical Report:

The rolling technical report should be formatted according to the Springer LNCS format, using either the LaTeX template or the Word template. LaTeX is the preferred format.

Submission Guidelines

We do not give strict rules to annotate the *txt files except the following one: please try to produce as many spans than expressed ideas (one idea per span; e.g., one drug name, one symptom, one legal regulartion, etc.).

All system outputs are expected to be in the BRAT annotation format (i.e., a tabular *.ann file for each *.txt file, composed of three columns: (i) an annotation ID, (ii) category, starting offset, ending offset, and (iii) the corresponding text span). An example is shown below (please see sample.{ann,txt} files in the archives to be downloaded):

T1	drug-trt 34 68	Irbesartan Hydrochlorothiazide BMS
T2	sosy-dis 116 125	dizziness

The six categories to be used in those *.ann files are: drug-trt sosy-dis behavior legal-reg tests findings. A 7th category named other can be used if the participant consider there is a missing class for a useful Covid-19 related information.

Participants are expected to check that their submissions fit all previously described elements: file names, format, entity tags, correct offsets of characters

Participating teams should satisfy the following guidelines:

Submission Upload:

Runs should be uploaded in the repository provided by the organizers. Following the repository structure discussed above, for example, a run submitted for the first round of the Information Extraction task should be included in submission/task1/round1.

Runs are composed of a set of several id.ann files (where the id fits the id from the corresponding id.txt file) and should be uploaded in an archive (one archive per run and for each language) named with the following name convention: <teamname>_task1_<round>_<language>_<freefield>.tar.gz where:

Performance scores for the submitted runs will be returned by the organizers in the score folder, which follows the same structure as the submission folder.

The rolling technical report has to be uploaded and kept update in the report folder.

Here, you can find a sample participant repository to get a better idea of its layout.

Evaluation:

For this task, we will use the standard evaluation metrics: recall, precision, and F1-score, as well as macro and micro averages.

Organizers

Thierry Declerck, DFKI, Germany
declerckdfki.de

Cyril Grouin, LISN, France
cyril.grouinlisn.upsaclay.fr

Pierre Zweigenbaum, LISN, France
pzlisn.upsaclay.fr