DecodingTrust: Revealing the Credibility Vulnerabilities of Large Language Models

2025-07-14 03:51:55

Abstract generation in progress

Assessing the Trustworthiness of Large Language Models: DecodingTrust Research Findings

A team composed of multiple universities and research institutions recently released a comprehensive evaluation platform for the credibility of large language models (LLMs). This research aims to comprehensively assess the reliability of the generative pre-trained transformer model (GPT).

Research has uncovered some previously undisclosed vulnerabilities related to credibility. For example, the GPT model is prone to generating harmful and biased outputs, and may leak private information from training data and conversation history. Although GPT-4 is generally more reliable than GPT-3.5 in standard benchmark tests, it is more susceptible to attacks when faced with maliciously designed prompts. This may be because GPT-4 more strictly follows misleading instructions.

The research team conducted a comprehensive evaluation of the GPT model from eight different perspectives, including its performance in adversarial environments. For example, they assessed the model's robustness against text adversarial attacks, using standard benchmarks and self-designed challenging datasets.

Research has also found that GPT models can sometimes be misled into producing biased content, especially when faced with carefully designed misleading system prompts. The degree of bias in the model often depends on the demographic groups and stereotype themes mentioned in the user prompts.

In terms of privacy, research has found that GPT models may leak sensitive information from training data, such as email addresses. GPT-4 is generally more robust than GPT-3.5 in protecting personal identity information, but both models perform well on certain types of information. However, when examples of privacy breaches occur in the conversation history, both models may leak all types of personal information.

This research provides important insights for assessing and improving the credibility of large language models. The research team hopes that this work will drive more research and ultimately help develop more powerful and reliable AI models.

GPT-8.92%

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

10 Likes