top of page

Survey of Large Language Models (llms) - 31st March 2023

The meteoric rise of systems like ChatGPT has ignited intense interest in evaluating large language models (LLMs). These powerful AI tools display an uncanny mastery of natural language, fueling speculation about their path to artificial general intelligence.

But how can we comprehensively and rigorously assess these emergent capabilities? A sweeping new survey paper from top Chinese AI experts provides pivotal insights. By divesting these black box systems, their analysis lays the groundwork for creating AI that benefits humanity.



rise / increase in llm papers
rise in number of llm papers


Why Language Model Evaluation Matters

As LLMs grow more pervasive, proper evaluation becomes critical for two key reasons:

  1. Illuminates Strengths and Weaknesses: Testing sheds light on areas where LLMs excel versus falter, guiding research.

  2. Ensures Reliability and Safety: Vetting for robustness, security, and ethics is crucial before deployment, especially in sensitive domains like healthcare.

By cataloging the techniques and tasks for probing LLMs, this survey paper constitutes an invaluable reference for practitioners while highlighting open challenges.

20 Key Revelations in Language Model Evaluation:

  1. Varied evaluation dimensions: What to evaluate, where to evaluate, how to evaluate.

  2. Natural language tasks assess language modeling capabilities.

  3. Reasoning evaluations probe inference abilities.

  4. Robustness assessments gauge model stability and security.

  5. Ethics evaluations uncover biases and safety issues.

  6. Scientific evaluations test scientific knowledge and reasoning.

  7. Medical evaluations measure clinical language and knowledge.

  8. Social science assessments gauge reasoning about human society.

  9. Engineering evaluations test coding skills and common sense.

  10. General benchmarks offer broad capabilities testing.

  11. Specific benchmarks target precise domains like medicine.

  12. Automatic metrics enable efficient LM scoring.

  13. Human evaluation provides real-world assessment.

  14. Model scale impacts performance at different tasks.

  15. LLM success varies greatly across tasks.

  16. Reasoning and robustness are key weaknesses.

  17. Careful prompting is key to optimal performance.

  18. Alignment with human values remains challenging.

  19. Ongoing research must address emerging risks.

  20. Evaluation is integral to developing beneficial AI.

The paper provides a comprehensive taxonomy of language model evaluation across three key dimensions:

  1. What to evaluate - This encompasses the range of capabilities that need to be tested in language models, including natural language processing, reasoning, robustness, ethics, scientific knowledge, medical applications, social sciences, engineering skills, and more.

  2. Where to evaluate - The paper summarizes the diverse datasets and benchmarks used for language model evaluation, including general testing suites like GLUE, SuperGLUE, and BigBench as well as specialized benchmarks for particular tasks and domains.

  3. How to evaluate - This covers the techniques used for evaluation, ranging from automatic metrics like accuracy and BLEU score to human evaluation by experts annotating model performance.

A key finding is that while language models demonstrate impressive performance on many natural language tasks, they still exhibit significant weaknesses when it comes to capabilities like reasoning, robustness, and alignment with human values. For example, models struggle with complex logical reasoning and multi-step inference. They are also vulnerable to adversarial attacks and can demonstrate social biases.

Comments


bottom of page