Click here to flash read.
Extracting structured and grounded fact triples from raw text is a
fundamental task in Information Extraction (IE). Existing IE datasets are
typically collected from Wikipedia articles, using hyperlinks to link entities
to the Wikidata knowledge base. However, models trained only on Wikipedia have
limitations when applied to web domains, which often contain noisy text or text
that does not have any factual information. We present WebIE, the first
large-scale, entity-linked closed IE dataset consisting of 1.6M sentences
automatically collected from the English Common Crawl corpus. WebIE also
includes negative examples, i.e. sentences without fact triples, to better
reflect the data on the web. We annotate ~21K triples from WebIE through
crowdsourcing and introduce mWebIE, a translation of the annotated set in four
other languages: French, Spanish, Portuguese, and Hindi. We evaluate the
in-domain, out-of-domain, and zero-shot cross-lingual performance of generative
IE models and find models trained on WebIE show better generalisability. We
also propose three training strategies that use entity linking as an auxiliary
task. Our experiments show that adding Entity-Linking objectives improves the
faithfulness of our generative IE models.
No creative common's license