Multilingual Semantic Search - Covid-19 MLIA @ Eval

Task Description

The goal of the Multilingual Semantic Search task is to collect relevant information for the community, the general public as well as other stakeholders, when searching for health content in different languages and with different levels of knowledge about the specific topic.
There will be two sub-tasks: subtask 1 is a classic ad-hoc multilingual search task focused more on high precision; subtask 2 is more oriented towards high-recall systems, like Technology Assisted Review (TAR) systems.

To participate in the Multilingual Semantic Search task, the groups need to register at the following link:

Register

Important Dates - Round 2

Round starts: June 14, 2021

Corpora and topics released: June 28, 2021

Runs due from participants: ~~September 15, 2021~~ September 25, 2021

Ground-truth released and runs scored: October 29, 2021

Rolling report submission deadline (camera ready): November 19, 2021

Slot for a virtual meeting to discuss the results: November 30 - December 2, 2021

Round ends: December 2, 2021

Participation Guidelines

The overall organization will follow a CLEF-style evaluation process with a shared dataset composed of a: collection of documents, a set of topics, and a set relevance assessments.

The languages of the collections for the task are (ISO 6391-1 codes within parentheses):

Arabic (ar);
English (en);
French (fr);
German (de);
Greek (el);
Italian (it);
Spanish (es);
Swedish (sv);
Ukranian (uk).

Topics will be available in the above languages and, in addition:

Chinese (zh);
Japanese (ja).

In this second round, we plan to investigate the effectiveness of the systems in a classic multilingual lexical search fashion. The information about relevance of the documets provided in the first round can be used to train and optimize systems as well as simulate interactive systems where the answer of the user can be used as a feedback to the systems.

For each of the two subtasks described in the following, we welcome to types of submissions

monolingual runs where the language of the collection and the language of the topics is the same;
bilingual runs where the language of the collection and the language of the topics are different same.

Subtask 1 - High Precision:

In this subtask, participants are required to build systems that will help the general public to retrieve the most relevant documents on the Web concerning COVID-19 efficiently. The main focus of this subtask is on the top ranked documents; evaluation measures like Precision at 5 and 10 documents as well as Normalized Discounted Cumulative Gain will be used to compare systems.

Substask 2 - High recall:

In this subtask, the focus is more on the problem of finding as many relevant documents as possible with the least effort. Given a limited amount of resources, such as a time limit and expert availability in time of crisis, there will be a limit on the maximum number of documents that can be retrieved in order to build a set of relevant documents that should be delivered to the general public. Evaluation measures like Recall@k and Area Under ROC will be used to compare the systems. In this second round, the systems can use the information about the relevance assessments provided in the first round to optimize the effectiveness. Only for this subtask 2, we encourage participants to re-run experiments of round 1 using the provided relevance assessments. In this way, a system can simulate an interaction between the user where the reevance feedback can be used to optimize the ranking of the subsequent documents.

Corpora:

Please, find hereby the links to download the corpus for each round.

Topics:

The topics have been created by selecting 1) a subset of the queries created for the TREC-COVID Task (courtesy of TREC-COVID Task organizers) and 2) a selection of queries made available in the Bing search dataset for Coronavirus Intent which includes queries from all over the world that had an explicit/implicit intent related to the Coronavirus or Covid-19.
Topics are structured in the following way:


<topic number"topic identifier" xml:lang="ISO 639-1 code" >
	<keyword>keyword based query</keyword>
	<conversational>the query as a question posed by the user</conversational>
	<explanation>a more detailed explanation of what the set of retrieved documents should look like</explanation>
</topic>

The keyword field represents the “traditional” way a user performs the search on a Web search engine. It is basically a set of keywords, i.e. "surgical mask protection".
The conversational field is more like a way of asking the same thing in a verbal way, i.e. "does a surgical mask protect from covid-19?"
The explanation field is used to provide information to the assessors when performing relevance assessments, i.e. "The documents retrieved should contain information about …".

Please, find hereby the links to download the topics for each round.

Relevance Judgements:

After participants submit their runs, a subset of documents for each run will be pooled for each topic in order to get a sample of documents to judge.

Please, find hereby the links to download the relevance judgements for each round.

relevance jugdements for Round 1.
relevance jugdements for Round 2 (to be update).

Participant Repository:

Participants are provided with a single repository for all the tasks they take part in. The repository contains the runs, resources, code, and report of each participant.

The repository is organised as follows:

submission: this folder contains the runs submitted for the different tasks in the different evaluation rounds.
score: this folder contains the performance scores of the submitted runs.
code: this folder contains the source code of the developed system.
resource: this folder contains (language) resources created during the participation.
report: this folder contains the rolling technical report describing the techniques applied and insights gained during participation, round after round.

Covid-19 MLIA Eval consists of three tasks run in three rounds. Therefore, the submission and score folders are organized into sub-folders for each task and round as follows:

submission/task1/round1: for the runs submitted to the first round of the first taks. Similar structure for the other tasks and rounds.
score/task1/round1: for the performance scores of the runs submitted to the first round of the first taks. Similar structure for the other tasks and rounds.

Participants which do not take part in a given task or round can simply delete the corresponding sub-folders.

The goal of Covid-19 MLIA Eval is to speed up the creation of multilingual information acces systems and (language) resources for Covid-19 as well as openly share these systems and resources as much as possible. Therefore, participants are more than encouraged to share their code and any additional (language) resources they have used or created.

All the contents of these repositories are realeased under the Creative Commons Attribution-ShareAlike 4.0 International License.

Task Repository:

Organizers share contents common to all participants through the Multilingual Semantic Search task repository.

The repository is organised as follows:

topics: this folder contains the topics to be used for task.
ground-truth: this folder contains the ground-truth, i.e. the qrels, for the task.
report: this folder contains the rolling technical report describing the overall outcomes of the task, round after round.

Covid-19 MLIA Eval runs in three rounds. Therefore, the topics and ground-truth folders are organized into sub-folders for each round, i.e. round1, round2, and round3.

All the contents of this repository are realeased under the Creative Commons Attribution-ShareAlike 4.0 International License.

Rolling Technical Report:

The rolling technical report should be formatted according to the Springer LNCS format, using either the LaTeX template or the Word template. LaTeX is the preferred format.

Submission Guidelines

Participating teams should satisfy the following guidelines:

The runs should be submitted in TREC format (described below);
Each group can submit a maximum of 5 monolingual runs for each language for each subtask and 5 bilingual runs for each pair of languages for each subtask;
The code used to produce the runs should be uploaded in a Bitbucket repository provided by the organizers upon the registration to Covid-19 MLIA Eval.

Submission for subtask 1 - High precision

The run must have a limit of 1,000 documents retrieved per topic. Therefore, the file of the run submitted for this task must contain no more than 30,000 lines. Any additional retrieved document will not be considered in the evaluation.

Submission for subtask 2 - High recall

For this subtask, there is no limit for the number of documents retrieved per topic. However, the maximum number of retrieved documents allowed in a run is 6,000. Therefore, each run can have a variable number of documents retrieved per topic (on average 200 documents per topic), but the file of the run submitted for this task must contain no more than 6,000 lines. Any additional retrieved document will not be considered in the evaluation.
Only for this subtask, each group can re-run the experiments of round 1 using the relevance assessments as an explicit relevance feedback. The number of runs and the requirements for the number of documents retrieved are identical to the ones described in this section. Please, use the freefield in the run name with the value rerun_round1 to indicate this type of runs.

Trec Format:

Runs should be submitted with the following format:


30 Q0 ZF08-175-870  0 4238 prise1
30 Q0 ZF08-306-044  1 4223 prise1
30 Q0 ZF09-477-757  2 4207 prise1
30 Q0 ZF08-312-422  3 4194 prise1
30 Q0 ZF08-013-262  4 4189 prise1
...

where:

Columns are separated by a white space;
The first column is the topic number;
The second column is the query number within that topic, which is currently unused and should always be Q0;
The third column is the official document number of the retrieved document, which is the name of the file without the extension (see below for some examples);
The fourth column is the rank at which the document is retrieved;
The fifth column shows the score (integer or floating point) that generated the ranking. This score must be in descending (non-increasing) order. It is important to include the score so that we can handle tied scores (for a given run) in a uniform fashion (trec_eval sorts documents by these scores, not your ranks);
The sixth column is called the "run tag" and should be an unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run.

It is important to include all the columns and have a white space delimiter between the columns.
Please, find hereby a list of examples of valid document identifiers:

eupresscorner-de-ip_20_232
eurlex-sv-celex_52020XG0609_04_1
gv-de-20200225-42421
medisys-de-2020_04_1.xml_5
wikipedia-uk-951614

Submission Upload:

Runs should be uploaded in the repository provided by the organizers. Following the repository structure discussed above, for example, a run submitted for the first round of the Multilingual Semantic Search task should be included in submission/task2/round2. In particular:

runs submitted to subtask 1 should be put in submission/task2/round2/subtask1;
runs submitted to subtask 2 should be put in submission/task2/round2/subtask2.

Runs should be uploaded with the following name convention: <teamname>_task2<N>_<round>_<language>_<freefield> where:

teamname is the name of the participating team;
task2<N> is the identifier of the Multilingual Semantic Search task, i.e. task21 for subtask 1 and task22 for subtask 2;
round is the round of Covid-19 MLIA @ Eval the run is submitted to. It could be round1, round2, or round3
language specifies whether it is a monolingual (mono) or a bilingual run (bili). By using ISO 639-1 language codes, it also specifies the source language as well as the target language, in case of bilingual runs. For example:
- mono-en indicates a monolingual English run;
- bili-zh2uk indicates a bilingual Chinese to Ukrainan run;
freefield is a free field that participants can use as they prefer. Use this field with the value rerun_round1 (together with any additional information that you need) if this run is a rerun of round 1.

For example, a complete run identifier may look like unipd_task21_round2_bili-it2sv_bm25 where

unipd is the University of Padua team;
task21 is submitted for subtask 1;
round2 indicates that the run has been submitted to the second round;
bili-it2sv indicates a bilingual Italian to Swedish run;
bm25 suggests that participants have used BM25 as retrieval model.

Another example unipd_task21_round2_bili-it2sv_rerun_round1_bm25 where

unipd is the University of Padua team;
task21 is submitted for subtask 1;
round2 indicates that the run has been submitted to the second round;
bili-it2sv indicates a bilingual Italian to Swedish run;
bm25 indicates that this run is a rerun of round 1 and suggests that participants have used BM25 as retrieval model.

Performance scores for the submitted runs will be returned by the organizers in the score folder, which follows the same structure as the submission folder.

The rolling technical report has to be uploaded and kept update by participants in the report folder.

Here, you can find a sample participant repository to get a better idea of its layout.

Evaluation:

The effectiveness of the submitted runs will be evaluated with the following measures:

Precision at 5 (P@5)
Average Precision (AP)
normalized Discounted Cumulated Gain (nDCG)
R-Precision (RPrec)
Recall

Organizers

Giorgio Maria Di Nunzio, University of Padua, Italy
dinunziodei.unipd.it

Maria Eskevich, CLARIN ERIC
mariaclarin.eu