Resources

AppSOC Labs Report: Testing DeepSeek-R1 vs. Qwen-2.5

Notable Results

The DeepSeek-R1 model available on Azure produced similar results to versions previously tested that were downloaded directly from DeepSeek (see previous reports). Applying Azure’s content filters improved performance in some categories, but the model still had very high failure rates in critical areas. The AppSOC Model Knowledge Base risk scores rate DeepSeek-R1 with filters at 8.3/10, and DeepSeek-R1 without filters at 8.4/10. Both are rated as High Risk.

  • Azure filters significantly reduced Jailbreak failure rates for DeepSeek-R1 (5% vs. 37.6%).
  • Toxicity failure rates were reduced more than 70% with Azure filters applied (4% vs. 14.8%).
  • Training Data Leak failures were also reduced more than 2/3 by the Azure filters (10% vs 32.7%).
  • Hallucination and Glitch failures were eliminated by the Azure filters in the tests conducted.
  • Malware failure rates were slightly improved by Azure filters but were still unacceptably high (93.8% vs. 96.7%).
  • However, the Azure filters produced worse results for Prompt Injection (57.1% vs. 40%).
  • Supply Chain failure rates were also slightly raised with the Azure filters (6.9% vs. 5.8%).
  • Virus failure rates were unchanged by Azure filters remaining unacceptably high (93.3%).

Complete this form to access this resource