Qwen2: Alibaba’s Open-Source LLM Evolves with Enhanced Capabilities and Multilingual Prowess

Jun 7, 2024

Alibaba, Apache 2.0, Github, Group Query Attention, Hugging Face, LLM, multilingual, Open Source, Qianwen License, Qwen2

Alibaba makes another impactful contribution to the open-source LLM landscape with the release of Qwen2, a substantial upgrade to its predecessor, Qwen1.5. Qwen2 arrives with an array of model sizes, expanded language support, and impressive performance enhancements, positioning it as a versatile tool for diverse AI applications.

However, if you want more details go see the following sections:

Scaling Up: A Model for Every Need

Recognizing that one size doesn’t fit all in the world of AI, Qwen2 offers five distinct model sizes to accommodate various computational resources and application needs:

Model	Parameters	Non-Emb Params	GQA	Tie Embedding	Context Length	Minimum GPU VRAM (BF16)
Qwen2-0.5B	0.49B	0.35B	✓	✓	32K	1GB
Qwen2-1.5B	1.54B	1.31B	✓	✓	32K	4GB
Qwen2-7B	7.07B	5.98B	✓	✗	128K	16GB
Qwen2-57B-A14B	57.41B	56.32B	✓	✗	64K	128GB
Qwen2-72B	72.71B	70.21B	✓	✗	128K	128GB

This variety empowers developers to select the model size that best balances computational efficiency with the required capabilities for their specific use case. (However, remember that the Minimum GPU VRAM requirements are estimations for inference using BF16 precision. Actual requirements may vary depending on factors like batch size, sequence length, and specific hardware configurations.)

Key Architectural Enhancements:

Group Query Attention (GQA) for All: Leveraging its success in Qwen1.5, GQA is now implemented across all Qwen2 models. This architectural choice accelerates inference and reduces memory requirements, enhancing Qwen2’s accessibility for wider deployment.
Tying Embedding for Smaller Models: Qwen2-0.5B and Qwen2-1.5B utilize tying embedding to optimize parameter usage, especially important given the significant proportion of parameters allocated to large embeddings in smaller LLMs.
Extended Context Length: Qwen2 pushes the boundaries of context length, with Qwen2-7B-Instruct and Qwen2-72B-Instruct demonstrating the capability to handle contexts up to 128K tokens. This extended window enables the processing and comprehension of larger text chunks for more complex language tasks.

Breaking Down Language Barriers: A Truly Multilingual LLM

Moving beyond the common English and Chinese focus, Qwen2 embraces a global approach by incorporating data from 27 additional languages representing a variety of linguistic families:

Western Europe: German, French, Spanish, Portuguese, Italian, Dutch
Eastern & Central Europe: Russian, Czech, Polish
Middle East: Arabic, Persian, Hebrew, Turkish
Eastern Asia: Japanese, Korean
South-Eastern Asia: Vietnamese, Thai, Indonesian, Malay, Lao, Burmese, Cebuano, Khmer, Tagalog
Southern Asia: Hindi, Bengali, Urdu

This broad language coverage, combined with focused efforts to address code-switching, makes Qwen2 a potent tool for multilingual natural language processing tasks.

Performance that Speaks for Itself: Benchmarking Qwen2

Qwen2 backs up its impressive features with strong performance on a wide array of benchmarks. Let’s examine the performance of the models in comparison to some of the best counterparts, Llama3-70B for the performance and Phi-3-Mini for the efficiency.

Qwen2-72B vs. Llama3-70B: A Battle of Giants

Dataset	Qwen2-72B	Llama-3-70B
English
MMLU	84.2	79.5
MMLU-Pro	55.6	52.8
GPQA	37.9	36.3
Theorem QA	43.1	32.3
Coding
HumanEval	64.6	48.2
MBPP	76.9	70.4
EvalPlus	65.4	54.8
MultiPL-E	59.6	46.3
Mathematics
GSM8K	89.5	83.0
MATH	51.1	42.5
Multilingual
Multi-Exam	76.6	70.0

We can say that Qwen2-72B demonstrates a consistent performance advantage over Llama-3-70B across all evaluated tasks, highlighting its strong grasp of English language understanding, coding capabilities, and mathematical reasoning.

Phi-3-Mini vs the Rest

Dataset	Qwen2-0.5B	Phi-3-Mini	Qwen2-1.5B	Qwen2-7B	Qwen2-57B-A14B
English
MMLU	45.4	68.1	56.5	70.3	76.5
HellaSwag	49.3	74.5	66.6	80.7	85.2
TruthfulQA	39.7	63.2	45.9	54.2	57.7
Coding
HumanEval	22.0	57.9	31.1	51.2	53.0
MBPP	22.0	62.5	37.4	65.9	71.9
Mathematics
GSM8K	36.5	83.6	58.5	79.9	80.7

While Phi-3-Mini always outperforms Qwen2-0.5B and Qwen2-1.5B, likely due to its larger size (3.8B parameters compared to 0.5B and 1.5B), these small models still demonstrate a reasonable capability for its size.

Highlights: Focusing on What Matters

Coding & Mathematics: Sharpening Qwen2’s Analytical Edge

Qwen2-72B, in particular, showcases significant improvements in coding and mathematical capabilities. These enhancements are evident in its performance on benchmarks like HumanEval, MBPP, GSM8K, and MATH. This highlights Qwen2’s potential for complex problem-solving tasks.

Long Context Understanding: Unlocking New Possibilities

Qwen2’s extended context length, especially in the 7B and 72B models, opens up possibilities for handling long-form text processing. In fact, with the Needle in a Haystack test, where a random fact or statement (the ‘needle’) is in the middle of a long context window (the ‘haystack’) and the LLM must retrieve it, Qwen2 demonstrates good capability in extracting information from large volumes of text.

Safety and Responsibility: Prioritizing Ethical AI

Qwen2 incorporates a strong focus on safety and responsibility Qwen2-72B-Instruct, in particular, exhibits a low proportion of harmful responses, demonstrating its alignment with ethical AI principles.

Licensing: Navigating Openness and Restrictions

Qwen2 introduces a nuanced approach to licensing, with different models falling under different license agreements.

Apache 2.0 License: The majority of Qwen2 models, including Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, and Qwen2-57B-A14B, are released under the permissive Apache 2.0 license. This open-source license grants users broad freedoms to use, modify, distribute, and even commercialize the models, promoting accessibility and fostering a collaborative development ecosystem.
Qianwen License: The largest model, Qwen2-72B, and its instruction-tuned counterpart remain under the original Qianwen License. This license, while granting usage rights, imposes restrictions on commercial use for products or services exceeding 100 million monthly active users. This restriction aims to balance open access for research and development with Alibaba’s commercial interests in controlling the large-scale deployment of its most advanced model.

This dual-licensing approach presents both opportunities and challenges. The Apache 2.0 license encourages wider adoption and innovation for the smaller Qwen2 models, enabling developers to freely integrate them into various applications. However, the restrictions imposed by the Qianwen License on the largest Qwen2-72B model could potentially hinder its widespread commercial adoption, particularly for companies targeting large user bases.

Conclusion

What to say? Another good model to test is out… Let’s go checking its Hugging Face demo!