Click here to flash read.
In this work we systematically review the recent advancements in code
processing with language models, covering 50+ models, 30+ evaluation tasks,
170+ datasets, and 700 related works. We break down code processing models into
general language models represented by the GPT family and specialized models
that are specifically pretrained on code, often with tailored objectives. We
discuss the relations and differences between these models, and highlight the
historical transition of code modeling from statistical models and RNNs to
pretrained Transformers and LLMs, which is exactly the same course that had
been taken by NLP. We also discuss code-specific features such as AST, CFG, and
unit tests, along with their application in training code language models, and
identify key challenges and potential future directions in this domain. We keep
the survey open and updated on GitHub at
https://github.com/codefuse-ai/Awesome-Code-LLM.
No creative common's license