For each weight combination we examined the stability of system ranking using a resampling technique. We fixed the weight on Minor errors at 1, and explored a range of Major weights from 1 to 10 (the Major weight recommended in the MQM standard). Since we are ultimately interested in scoring segments, we require a weighting on error types. After an initial pilot run, we introduced a special Non-translation error that can be used to tag an entire segment which is too badly garbled to permit reliable identification of individual errors.Error severities are assigned independent of category, and consist of Major, Minor, and Neutral levels, corresponding respectively to actual translation or grammatical errors, smaller imperfections, and purely subjective opinions about the translation. Our error` hierarchy includes the standard top-level categories Accuracy, Fluency, Terminology, Style, and Locale, each with a specific set of sub-categories. To temper the effect of long segments, we imposed a maximum of five errors per segment, instructing raters to choose the five most severe errors for segments containing more errors. Each error was highlighted in the text, and labeled with an error category and a severity. Our annotators were instructed to identify all errors within each segment in a document, paying particular attention to document context. To adapt the generic MQM framework for our context, we followed the official guidelines for scientific research. Types of extra human evaluations Multidimensional Quality Metric (MQM) Mqm-ted_zhen.tsv MQM labels aqcuired for 15 submission of TED talks for Chinese-to-English. Mqm-ted_ende.tsv MQM labels aqcuired for 15 submission of TED talks for English-to-German. Mqm-newstest2021_zhen.tsv MQM labels aqcuired for 15 submission of newstest2021 for Chinese-to-English. Mqm-newstest2021_ende.tsv MQM labels aqcuired for 15 submission of newstest2021 for English-to-German. Psqm_newstest2020_zhen.tsv pSQM labels aqcuired for 10 submission of newstest2020 for Chinese-to-English.
Mqm_newstest2020_zhen.tsv MQM labels aqcuired for 10 submission of newstest2020 for Chinese-to-English. Psqm_newstest2020_ende.tsv pSQM labels aqcuired for 10 submission of newstest2020 for English-to-German. Mqm_newstest2020_ende.tsv MQM labels aqcuired for 10 submission of newstest2020 for English-to-German. You can use the MQM Viewer web app to open these TSV data files for computing scores as well as for interactively slicing and dicing (details and screenshots presented further down in this documentation).
We refer to our paper for more details of the experimental setup. The resulting human ratings are more reliable than crowd-worker human evaluations.
#Mqm web professional
We re-annotated the WMT English to German and Chinese to English test sets newstest2020, newstest2021, and the TED talks WMT21 test suite with raters that are professional translators and native speakers of the target language. The contents of this repository are not an official Google product.
Expert-based Human Evaluations for the Submissions of WMT 2020 and WMT 2021 for English to German and Chinese to English.