Zhipu GLM-5V-Turbo Technical Report: Design2Code super Claude Opus4.6, directly generate code from the screenshot

robot
Abstract generation in progress

According to Beating Monitoring, Zhipu AI released the GLM-5V-Turbo technical report. The model was launched on Z.ai API and OpenRouter in early April; this release is a methodological disclosure supplement, and the model has not been open-sourced. GLM-5V-Turbo is Zhipu’s first multimodal programming foundation model, supporting around 200K context length, and can connect with agent frameworks such as Claude Code and OpenClaw. Unlike most approaches that treat vision as an attachment to language models, this model integrates visual perception into the entire reasoning, planning, tool invocation, and execution process from the pre-training stage.

The model architecture has three key design elements. First is the new visual encoder CogViT, which is pre-trained through dual teacher distillation using SigLIP2 and DINOv3, then aligned with contrastive learning on 8 billion bilingual Chinese-English image-text data. Second is multimodal multi-token prediction (MMTP), which replaces direct visual embedding transmission with a shared learnable <|image|> special token, reducing communication complexity across pipeline stages and making training more stable. Third is joint reinforcement learning over more than 30 tasks, covering perception, reasoning, and agent execution at three levels.

The improvements during the RL phase are widely distributed: 2D image localization +4.8%, video understanding +5.6%, 3D localization +7.7%, OCR +4.2%, chart understanding +7.7%, GUI agent (OSWorld) +4.9%, multimodal search tool invocation +3.5%. The team notes in the paper that multi-task RL differs from the common cross-domain interference seen in SFT, with each capability stably improving together, and reasoning patterns learned in one domain even transferring to others.

Specific benchmark scores: Design2Code 94.8, surpassing Claude Opus by 4.6; OSWorld 62.3, AndroidWorld 75.7; multimodal search MMSearch 72.9, BrowseComp-VL 51.9; pure text programming on CC-Bench-V2 backend (22.8), frontend (68.4), and code repository exploration (72.2) all outperform its pure text foundation GLM-5-Turbo. MMSearch-Plus scored 30.0, nearly 8 times higher than the previous generation GLM-4.6V; the self-developed visual deep search benchmark ImageMining scored 30.7.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin