GLM-5V-Turbo: A Leap Towards Native Multimodal AI Agents

In the fierce competition among domestic large models, Zhipu’s GLM series has consistently held a highly valuable card: exceptional coding capabilities.

As AI transitions from large language models to intelligent agents, the industry competition has entered its second phase, where developers and the development ecosystem exhibit the strongest willingness to pay.

However, industry giants clearly expect more from AI than just an “outsourced programmer”; only by becoming a versatile intelligent agent capable of truly managing system workflows can AI integrate into the lives of ordinary people.

Thus, a powerful AI that merely types is far from sufficient; it must develop the ability to perceive, analyze webpage layouts, understand charts, and interpret complex non-textual information on GUIs.

Recently, DeepSeek initiated a gray test for its “Image Recognition Mode.”

Now, Zhipu is closely following suit, officially embarking on a new exploration in the multimodal field. In the technical report for the latest model, GLM-5V-Turbo, we can clearly see that this is Zhipu’s new offensive towards a native multimodal intelligent agent, filled with technical prowess, engineering compromises, and commercial considerations.

01 The Aesthetic of Violent Beauty and Fine Manipulation in Visual Foundations

Integrating visual capabilities into large language models has been a frequently attempted approach over the past few years.

However, the resulting visual language models (VLMs) often end up being mere patchwork products, where the language model serves as the brain and the visual module acts merely as an external camera.

In other words, the model cannot comprehend the logic embedded in images and other information. Forcing two-dimensional visual signals into a one-dimensional token sequence results in an inability to understand images, overlook key details, and even produce severe hallucinations, making it unsuitable for use as an intelligent agent.

Therefore, GLM-5V-Turbo sets the tone right from the start:

Multimodal perception must not merely be an auxiliary interface; it must become a core component of model reasoning, planning, tool invocation, and task execution.

To achieve true “native” functionality, Zhipu has undertaken three major architectural overhauls:

1. Reconstructing the Visual Foundation: CogViT Designed for Agents

Intelligent agents need to control users’ computers, so in graphical user interfaces, the model must not only know what is in an image but also pay attention to various easily overlooked details, even if a button is only a few pixels in size.

To this end, Zhipu has developed a high-parameter-efficient visual encoder called CogViT, employing a two-stage pre-training approach:

The first stage focuses on feature reconstruction, where two teacher models, SigLIP2 and DINOv3, help the model recognize semantics and textures, respectively, enhancing the model’s visual feature expression through masked image modeling.
The second stage involves image-text alignment, using the NaFlex scheme to handle dynamic resolutions, directly increasing the global batch size to 64K.

This design approach significantly enhances the spatial perception and geometric understanding capabilities of Zhipu’s new model, laying the groundwork for subsequent control of web pages and mobile UIs.

2. Balancing Engineering and Algorithms: Multimodal Multi-Token Prediction (MMTP)

The introduction of multimodal capabilities inevitably leads to an exponential increase in memory and computational power consumption.

Developers in the AI field are well aware that Zhipu’s computational resources have been limited over the past six months, with previous price adjustments sparking intense discussions, indirectly confirming that in the face of large-scale inference, computational costs are a black hole.

Introducing multi-token prediction (MTP) to enhance inference efficiency is a common practice in the industry. However, Zhipu made a textbook-level engineering decision when implementing MTP:

Instead of directly passing a large amount of visual information to the MTP prediction head, a shared special token is used as a placeholder for visual input.

This seemingly simple change is actually the most aligned with “engineering pragmatism.” It significantly reduces the communication complexity in pipeline parallelism and directly avoids the headache of memory explosion.

Moreover, while ensuring model convergence stability, this “clever idea” can greatly reduce the computational costs of training and inference.

3. Breaking the Long Tail Curse: A Large-Scale Multimodal Reinforcement Learning System

Currently, the training approach for intelligent agents is fundamentally similar to that of large language models, still relying on reinforcement learning.

However, during the training process of intelligent agents, single-task reinforcement learning can easily lead to model oscillation.

Zhipu’s research team discovered that multi-task collaborative reinforcement learning allows the model to experience a richer distribution of strategies and even facilitates cross-task cognitive transfer.

As a result, Zhipu conducted joint reinforcement learning across more than 30 task categories and achieved full pipeline decoupling and asynchronous execution in its infrastructure. They not only moved the visual segmentation step from the forward propagation phase to the data loading phase but also implemented extreme memory management for GPU communication.

02 Transitioning from API Distribution to Workflow Management

The underlying technological reconstruction ultimately points to a leap in commercial monetization logic.

The multimodal depth research capabilities exhibited by GLM-5V-Turbo signal two significant shifts in Zhipu’s AI applications:

First, breaking the barriers of traditional text-based SaaS with multimodal depth research.

Most AI assistants have only been able to read pure text content. Even when users are allowed to upload images, videos, or PDFs, the AI’s recognition ability can plummet if too much non-text information is included.

However, GLM-5V-Turbo can autonomously execute the workflow of “planning → multimodal reading → state updating,” directly parsing high-value visual information from various charts, documents, and presentations, delivering Markdown business reports and highly structured slides.

In this regard, Zhipu’s approach is nearly identical to Anthropic’s recent launch of Claude for Microsoft 365, directly entering the Microsoft ecosystem.

Thus, traditional information retrieval tools will inevitably face a dimensionality reduction blow. When AI can deliver end-to-end completed reports containing data visualizations, the token-based billing model will gradually shift towards a “project-based billing” commercial model.

Second, the ultimate form of the agent will be a symbiosis between the model and its harness.

Zhipu’s technical report presents an insightful perspective:

The boundaries of a system’s capabilities are no longer solely determined by the model but are shaped by the model and its surrounding framework (Harness).

As one of the leading domestic models, Zhipu continues to provide a richer toolchain (Official Skills) and has achieved seamless integration with industry-standard frameworks like Claude Code and Auto Claw.

In fact, Zhipu has long recognized that a single AI startup cannot create an ecosystem as powerful as Google. Rather than going all-in, it is better to let globally applicable tools like Claude Code and AutoClaw, which excel at handling terminal and document logic, become agile hands for operating computers.

The long-anticipated myth of the “universal model” is now nearing its end; even strong players like OpenAI cannot achieve AGI solely through large language models. The future’s competitive edge will shift to the deep coupling of model capabilities and external tools.

After all, B-end enterprises, as the main paying force, have never needed a chatbot that can converse about anything; they need a cognitive-driven engine that can seamlessly integrate into existing systems.

03 The Hard Lessons: Three Laws of Intelligent Agent Development

Zhipu’s release of this technical report stands out because the research team candidly shares the design perspectives they summarized during the development process.

This “pitfall guide” earned through countless computational resources and sleepless nights is far more valuable than open-source models and technologies, and it holds significant importance for the entire AI industry.

First, never aim too high; foundational perception is the cornerstone that determines the model’s ceiling.

In the past year, the AI industry has gradually fostered a trend where all product releases come with labels like “deep thinking,” “self-reflection,” and “long-term logical planning,” as if only those labeled are advanced AIs.

However, user feedback reveals that these lofty labels have not been realized in specific application scenarios.

Zhipu found in practice that many seemingly advanced plans ultimately fail not due to accumulating minor errors in the process but because the model begins “blindly groping” from the first step. Whether it’s failing to notice subtle UI elements or misjudging the spatial position of buttons.

The operational logic of intelligent agents is entirely different from that of large language models; visual perception is not a low-level module that can be discarded after initial processing; it continuously constrains the upper limit of the model’s advanced reasoning capabilities.

Second, when training intelligent agents, one should abandon the blind faith in “end-to-end” approaches and actively embrace hierarchical optimization.

This does not deny the assertion that “training intelligent agents should use intelligent agent (rather than large language model) reinforcement learning,” but AI companies must also confront the reality of high training costs, scarcity of high-quality trajectory data, and lack of industry standards for evaluation.

Starting off by having the model learn extremely complex long-term tasks often results in either “getting the form without the essence” or the model crashing.

Zhipu’s approach is to dissect tasks meticulously, from recognizing icons at the lowest level to predicting single-step actions and planning entire behavior trajectories, conducting hierarchical optimization. This has proven to be not only a necessary compromise when computational resources are limited but also one of the best ways to achieve stable convergence of the model.

Finally, tasks that cannot be precisely evaluated hold no reference significance.

For current multimodal-capable intelligent agents, the most challenging aspect is not getting them to work but knowing how to objectively “score” their performance.

Compared to dialogue boxes on web pages, real computer environments are filled with openness and uncertainty. Zhipu realized that only by designing a validation process with strict step control that can isolate different dimensions of signals can such end-to-end evaluations become meaningful and guide the model’s iterative process.

04 Conclusion

After reviewing Zhipu’s technical report, it can be said that rather than a demonstration and explanation of model capabilities, it is more like a remote dialogue between the research team and users.

This report does not portray its model as flawless; instead, it raises several soul-searching industry mysteries:

How to achieve context compression memory in long-term tasks, given that videos and images consume significant memory?

When will models be able to free themselves from relying on human-provided standard answers and develop smarter interaction strategies?

These questions remain unanswered for now.

What we can observe is a rapidly evolving domestic model and the reality that the entire AI industry is entering a challenging deep water zone.

Enhancing multimodal capabilities is Zhipu’s necessary path toward a full-stack intelligent agent, but the computational bills along the way are omnipresent. In the face of limited computational resources, Zhipu has managed to carve out a commendable resource breakthrough through ingenious architectural design, extreme memory optimization, and hierarchical training strategies.

GLM-5V-Turbo has already proven its capability to take over users’ computer screens, while the next test is whether the entire market is ready to pay for the productivity of “native multimodal” solutions.