DeepSeek or Qwen for Enterprise AI? Trust Neither.

AppSOC Research Labs tested DeepSeek-R1 vs. Qwen-2.5 and found many flaws in both

DeepSeek or Qwen for Enterprise AI? Trust Neither.

The rapid advancement of AI models continues to generate excitement, but it also raises serious security concerns. AppSOC Research Labs recently conducted a comparative security analysis of DeepSeek-R1 and Qwen-2.5, two large language models (LLMs) that have gained industry attention. Our latest testing, performed using the AppSOC AI Security Platform, confirms a troubling reality: both models exhibit severe security vulnerabilities, rendering them unsuitable for enterprise deployment.

While Qwen-2.5-Math-1.5B outperformed DeepSeek-R1 in some areas, it also demonstrated even worse security failures in others. This follow-up analysis highlights the key similarities and differences in their risk profiles and reinforces a critical takeaway—neither model should be trusted for enterprise applications.

Please note that Hugging Face includes this security notice on Qwen-2.5-Math-1.5B: Qwen2.5-Math mainly supports solving English and Chinese math problems through CoT and TIR. We do not recommend using this series of models for other tasks.

Comparative Security Failures: DeepSeek-R1 vs. Qwen-2.5

Our testing methodology incorporated automated static analysis, dynamic testing, and red teaming to assess each model's susceptibility to real-world threats. The results expose significant weaknesses in both models, though their risk profiles differ in key areas.

1. Jailbreaking: Qwen-2.5 Performs Worse Than DeepSeek-R1

Jailbreaking attacks attempt to bypass built-in safety mechanisms, allowing models to generate restricted or harmful content. Qwen-2.5 failed spectacularly, with an 82% failure rate, more than double DeepSeek-R1’s 37.6% failure rate. This alarming result suggests Qwen-2.5 is even more vulnerable to adversarial prompts that could lead to misuse.

2. Prompt Injection: DeepSeek-R1 Is the Bigger Risk

Prompt injection attacks manipulate a model into revealing protected data, executing unauthorized actions, or circumventing safeguards. Here, Qwen-2.5 significantly outperformed DeepSeek-R1, with a failure rate of only 1.2% compared to DeepSeek-R1’s staggering 57.1%. While Qwen-2.5 demonstrates stronger protections against this class of attack, the model still suffers from other severe vulnerabilities.

3. Malware Generation: Both Models Are Dangerous

An AI model capable of generating malware poses an unacceptable security risk. DeepSeek-R1 had an alarming 96.7% failure rate, proving almost entirely incapable of blocking attempts to generate harmful scripts. Qwen-2.5 fared somewhat better but still failed 75.4% of the time—far too high for enterprise use.

4. Supply Chain Risks: Both Models Show Low Failure Rates

Supply chain security is crucial for AI adoption, as models must not provide incorrect or unsafe recommendations regarding external software and dependencies. DeepSeek-R1 and Qwen-2.5 both performed relatively well in this area, failing only 5.8% and 6.3% of tests, respectively.

5. Toxicity and Ethical Risks: Qwen-2.5 Is Worse

A responsible AI model must minimize the generation of harmful or offensive content. Here, Qwen-2.5 was the clear loser, failing 39.4% of toxicity tests compared to DeepSeek-R1’s 14.8%. This means Qwen-2.5 is more than twice as likely to produce inappropriate or harmful responses, raising major ethical concerns.

6. Glitches and Instability: Qwen-2.5 Shows Greater Instability

Models should behave predictably and reliably across different prompts. When tested for glitches using adversarial tokens, Qwen-2.5 exhibited 85.6% failure rates, while DeepSeek-R1 had a minimal failure rate of 1.0%. This suggests that Qwen-2.5 is more prone to erratic behavior, which could lead to unpredictable outputs in production environments.

7. Training Data Leakage: DeepSeek-R1 Shows Significant Risk

Training data leaks pose a serious risk, as they can expose proprietary or sensitive information. DeepSeek-R1 failed 32.7% of tests in this category, whereas Qwen-2.5 had a remarkably low failure rate of 0.7%. Organizations handling confidential data should be particularly wary of DeepSeek-R1’s vulnerabilities in this area.

8. Hallucination Rates: Both Models Struggle with Accuracy

Both models demonstrated high hallucination rates, meaning they frequently generated incorrect or fabricated information. DeepSeek-R1 failed 50.4% of the time, while Qwen-2.5 had a similar failure rate of 57.6%. This reinforces a key concern—neither model can be trusted to deliver consistently reliable information.

Enterprise Risk Scores: Both Models Are High-Risk

AppSOC quantifies security risks using a proprietary scoring system that assesses vulnerabilities, compliance concerns, and operational risks. Our final risk assessments for these models were:

  • Qwen-2.5: 9.0 / 10 (High Risk)
  • DeepSeek-R1: 8.4 / 10 (High Risk)

While Qwen-2.5 outperforms DeepSeek-R1 in select categories (e.g., prompt injection resistance, supply chain security), its overall risk score remains higher due to its extreme vulnerability to jailbreaking, toxicity issues, and system instability.

Why Enterprises Should Avoid Both Models

Despite their growing adoption, neither DeepSeek-R1 nor Qwen-2.5 is fit for enterprise use. The security failures documented in this report highlight risks that could lead to:

  • Data breaches: Prompt injections and jailbreaking could expose confidential or proprietary data.
  • Regulatory violations: Toxicity and compliance failures could create legal and ethical liabilities.
  • Cyber threats: The ability to generate malware makes both models a potential tool for attackers.

Given these risks, enterprises should be extremely cautious when considering open-source LLMs and prioritize rigorous security testing before any deployment.

AppSOC’s Recommendations for AI Security

To mitigate AI security risks, organizations must implement the following best practices:

  1. Model Risk Assessments: Continuously test AI models for security vulnerabilities before and after deployment.
  2. Automated AI Red Teaming: Use tools like the AppSOC AI Security Platform to simulate real-world attacks and ensure models can withstand adversarial manipulation.
  3. Access Controls & Guardrails: Limit model exposure and ensure robust policies are in place to prevent unauthorized use.
  4. Regulatory Compliance Monitoring: Stay ahead of evolving AI regulations by enforcing compliance-driven security checks.

Final Thoughts: AI Adoption Must Prioritize Security

The results of this comparative test should serve as a stark warning to enterprises. While some open-source AI models may appear attractive due to cost or performance claims, their security risks cannot be ignored. As our analysis of DeepSeek-R1 and Qwen-2.5 has shown, even widely used models can pose severe threats to organizations.

For enterprises looking to integrate AI securely, robust testing and continuous monitoring are non-negotiable. At AppSOC, we remain committed to ensuring AI adoption does not come at the expense of security.

Secure Your Path to AI Adoption.

For more information on AI security assessments, visit www.appsoc.com or contact us at info@appsoc.com.