Evaluation Metrics

How system outputs are scored for each subtask

Each subtask uses metrics well-suited to the nature of the prediction โ€” ranking for Subtask 1 and text similarity for Subtask 2.

๐Ÿ”
MRR@10

Subtask 1 โ€” Find It

Mean Reciprocal Rank at 10 measures how highly the correct original title is ranked among the top 10 candidates submitted by a system.

โœ๏ธ
F1 + BERTScore

Subtask 2 โ€” Fix It

Mean token-level F1 Score measures lexical overlap, while BERTScore captures semantic similarity between the reconstructed and original title.

Metric Details