Zhipu AI releases the GLM-5V-Turbo technical report, the first multimodal programming foundation model, with 200K context length, compatible with ClaudeCode/OpenClaw, not open source. The core three designs: CogViT visual encoding, MMTP shared <|image|> token, and joint reinforcement learning for over 30 tasks. Significant improvements in multi-domain RL, Design2Code 94.8, MMSearch-Plus 30.0, ImageMining 30.7.

BlockBeatNews

2026-05-08 02:53:48

Abstract generation in progress

According to Beating Monitoring, Zhipu AI released the GLM-5V-Turbo technical report. The model was launched on Z.ai API and OpenRouter in early April; this release is a methodological disclosure supplement, and the model has not been open-sourced. GLM-5V-Turbo is Zhipu’s first multimodal programming foundation model, supporting around 200K context length, and can connect with agent frameworks such as Claude Code and OpenClaw. Unlike most approaches that treat vision as an attachment to language models, this model integrates visual perception into the entire reasoning, planning, tool invocation, and execution process from the pre-training stage.

The model architecture has three key design elements. First is the new visual encoder CogViT, which is pre-trained through dual teacher distillation using SigLIP2 and DINOv3, then aligned with contrastive learning on 8 billion bilingual Chinese-English image-text data. Second is multimodal multi-token prediction (MMTP), which replaces direct visual embedding transmission with a shared learnable <|image|> special token, reducing communication complexity across pipeline stages and making training more stable. Third is joint reinforcement learning over more than 30 tasks, covering perception, reasoning, and agent execution at three levels.

The improvements during the RL phase are widely distributed: 2D image localization +4.8%, video understanding +5.6%, 3D localization +7.7%, OCR +4.2%, chart understanding +7.7%, GUI agent (OSWorld) +4.9%, multimodal search tool invocation +3.5%. The team notes in the paper that multi-task RL differs from the common cross-domain interference seen in SFT, with each capability stably improving together, and reasoning patterns learned in one domain even transferring to others.

Specific benchmark scores: Design2Code 94.8, surpassing Claude Opus by 4.6; OSWorld 62.3, AndroidWorld 75.7; multimodal search MMSearch 72.9, BrowseComp-VL 51.9; pure text programming on CC-Bench-V2 backend (22.8), frontend (68.4), and code repository exploration (72.2) all outperform its pure text foundation GLM-5-Turbo. MMSearch-Plus scored 30.0, nearly 8 times higher than the previous generation GLM-4.6V; the self-developed visual deep search benchmark ImageMining scored 30.7.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
710.11K Popularity
#
BitcoinFallsBelow80K
95.01M Popularity
#
IranUSConflictEscalates
81.46K Popularity
#
OilPriceRollerCoaster
1.02M Popularity
#
DailyPolymarketHotspot
843.42K Popularity

Sitemap

Zhipu GLM-5V-Turbo Technical Report: Design2Code super Claude Opus4.6, directly generate code from the screenshot

Trending Topics

GateSquareMayTradingShare

BitcoinFallsBelow80K

IranUSConflictEscalates

OilPriceRollerCoaster

DailyPolymarketHotspot

Pin