Presentation
Skills, Safety, and Trust Evaluation of Large Language Models for Science
DescriptionThe dramatic progress of Large Language Models (LLMs)
in the past 3-4 years opens the potential to use them for
scientific applications. To be applicable for scientific research,
the skills, trustworthiness and safety of LLMs must be tested.
While several frameworks/benchmarks have emerged as de
facto standards for evaluating general-purpose LLMs (e.g.,
Eleuther AI Harness [2] and HELM [3] for skills, DecodingTrust
[5] for trustworthiness), few of them are specifically
related to science. In this extended abstract, we report the
discussions of the ”Skills, Safety, and Trust Evaluation of
Large Language Models” break-out session of the Trillion
Parameter Consortium workshop in Barcelona (June 2024),
which exposed the gaps in the evaluation method that must be
addressed before using LLMs broadly in scientific contexts.
in the past 3-4 years opens the potential to use them for
scientific applications. To be applicable for scientific research,
the skills, trustworthiness and safety of LLMs must be tested.
While several frameworks/benchmarks have emerged as de
facto standards for evaluating general-purpose LLMs (e.g.,
Eleuther AI Harness [2] and HELM [3] for skills, DecodingTrust
[5] for trustworthiness), few of them are specifically
related to science. In this extended abstract, we report the
discussions of the ”Skills, Safety, and Trust Evaluation of
Large Language Models” break-out session of the Trillion
Parameter Consortium workshop in Barcelona (June 2024),
which exposed the gaps in the evaluation method that must be
addressed before using LLMs broadly in scientific contexts.