Click here to flash read.
Large Language Models (LLMs) have demonstrated considerable advances, and
several claims have been made about their exceeding human performance. However,
in real-world tasks, domain knowledge is often required. Low-resource learning
methods like Active Learning (AL) have been proposed to tackle the cost of
domain expert annotation, raising this question: Can LLMs surpass compact
models trained with expert annotations in domain-specific tasks? In this work,
we conduct an empirical experiment on four datasets from three different
domains comparing SOTA LLMs with small models trained on expert annotations
with AL. We found that small models can outperform GPT-3.5 with a few hundreds
of labeled data, and they achieve higher or similar performance with GPT-4
despite that they are hundreds time smaller. Based on these findings, we posit
that LLM predictions can be used as a warmup method in real-world applications
and human experts remain indispensable in tasks involving data annotation
driven by domain-specific knowledge.
No creative common's license