In this information extraction task, participants are expected to identify entities belonging to six categories of entities (the tag to be used in the outputs is shown in a box at the beginning of each line):
drug-trt
drug names, treatments, general intervention: this category concerns both commercial and generic names of drugs, as well as general intervention in the health domain; elements from this category usually come from advices from a professional (medical doctor, pharmacist) or from self-medication, e.g., Posaconazole AHCL, Allegra, Fexofenadine HCL, Xarelto, quarantine
sosy-dis
signs, symptoms, diseases: this category deals with medical problems and merges together all signs, symptoms, and diseases shortness of breath, extreme fatigue, fever, skin infection, weightloss
findings
findings, efficacy of treatments: this category is more complex since it concerns all elements related to positive or negative effets of treatments, including non expected stuff
tests
tests: this category concerns all tests performed to diagnose medical problems such as blood sample, physical exam, serological test
behavior
behaviors, everyday life actions: this category concerns all actions performed by each of us such as to wash one's hands, to cough into his elbow, to self-confine, use of face masks, physical distancing
legal-reg
legal dispositions, regulations: this category concerns all actions decided by local or national authorities (Government, Ministry, etc.), such as to download the employer certificate, list of authorized move, prolonged border closure, closure of educational institutions
Contrary to traditional named entities which generally fit short spans of text, this task may concern both short and long spans of text to be annotated.
Corpora
Description
For round 2, training and testing datasets are composed of files available in four languages (English, French, German, and Spanish). Two types of content are provided:
- news from six websites: deutschland.de, Deutsche Welle, EuroNews, EuroParl, Global Voices, and Global Voices (Covid-19)
- scientific abstracts published on PubMed (queries: Covid-19 bacteria, Covid-19 pfizer, long Covid-19, long Covid-19 asthma). Since abstracts are only available in English, we used a deep neural translation service on those abstracts in order to produce abstracts in French, German, and Spanish.
Manual annotations have been made on scientific abstracts (DE, EN, ES, FR) and a few news files (EN and FR only), see statistics below. We plan to release annotations for DE and ES on a few news files in the future weeks.
Statistics
On the training dataset from round #2:
- Number of files:
- English: 14 annotated PMID files + 39 annotated web files + 625 non-annotated web files
- French: 14 annotated PMID files + 28 annotated web files + 227 non-annotated web files
- German: 14 annotated PMID files + 453 non-annotated web files
- Spanish: 14 annotated PMID files + 403 non-annotated web files
- Number of annotations:
- English: 1419 entities: 36 behavior, 158 drug-trt, 30 findings, 199 legal-reg, 937 sosy-dis, 59 tests
- French: 833 entities: 13 behavior, 126 drug-trt, 13 findings, 154 legal-reg, 477 sosy-dis, 50 tests
- German: 216 entities (PMID only): 38 drug-trt, 10 findings, 135 sosy-dis, 33 tests
- Spanish: 214 entities (PMID only): 36 drug-trt, 8 findings, 137 sosy-dis, 33 tests
Participant Repository:
Participants are provided with a single repository for all the tasks they take part in.
The repository contains the runs, resources, code, and report of each participant.
The repository is organised as follows:
-
submission
: this folder contains the runs submitted for the different tasks in the different evaluation rounds.
-
score
: this folder contains the performance scores of the submitted runs.
-
code
: this folder contains the source code of the developed system.
-
resource
: this folder contains (language) resources created during the participation.
-
report
: this folder contains the rolling technical report describing the techniques applied and insights gained during participation, round after round.
Covid-19 MLIA Eval consists of three tasks run in three rounds.
Therefore, the submission
and score
folders are organized into sub-folders for each task and round as follows:
-
submission/task1/round2
: for the runs submitted to the second round of the first taks. Similar structure for the other tasks and rounds. Participants are encouraged to submit several runs with distinct results.
-
score/task1/round2
: for the performance scores of the runs submitted to the second round of the first taks. Similar structure for the other tasks and rounds.
Participants which do not take part in a given task or round can simply delete the corresponding sub-folders.
The goal of Covid-19 MLIA Eval is to speed up the creation of multilingual information acces systems and (language) resources for Covid-19 as well as openly share these systems and resources as much as possible. Therefore, participants are more than encouraged to share their code and any additional (language) resources they have used or created.
All the contents of these repositories are realeased under the Creative Commons Attribution-ShareAlike 4.0 International License.
Task Repository:
Organizers share contents common to all participants through the Information Extraction task repository.
The repository is organised as follows:
-
topics
: this folder contains the topics to be used for task.
-
ground-truth
: this folder contains the ground-truth, i.e. the qrels, for the task.
-
report
: this folder contains the rolling technical report describing the overall outcomes of the task, round after round.
Covid-19 MLIA Eval runs in three rounds.
Therefore, the topics
and ground-truth
folders are organized into sub-folders for each round, i.e. round1
, round2
, and round3
.
All the contents of this repository are realeased under the Creative Commons Attribution-ShareAlike 4.0 International License.
Rolling Technical Report:
The rolling technical report should be formatted according to the Springer LNCS format, using either the LaTeX template or the Word template. LaTeX is the preferred format.