Tag: multimodal

All the articles with the tag "multimodal".

From AlexNet to World Models: The Evolution of Multi-Modal Neural Networks

2 Jun, 2026

A ground-up tour of how neural networks learned to see, then to see-and-read, and finally to imagine. From AlexNet and CNNs, through CLIP and the vision-language models behind GPT-4V, to world models like Dreamer, V-JEPA 2, and LeWorldModel — with architectures, math, and benchmark numbers along the way.
Kimi K2.5: Joint Text–Vision Training and the Agent Swarm

19 May, 2026

A walkthrough of two ideas behind Kimi K2.5: how joint text–vision pre-training and RL make each modality help the other, and how Agent Swarm replaces sequential tool use with a learned parallel orchestrator.

From AlexNet to World Models: The Evolution of Multi-Modal Neural Networks