MAny: The Paper That Exposes a Hidden MLLM Crisis

MAny: The Paper That Exposes a Hidden MLLM Crisis

The MAny paper identifies a critical blind spot in multimodal instruction tuning: forgetting isn't just about language reasoning, but also about visual perception and parameter stability. Their merging approach offers a practical fix, but the real question is who will commercialize it first.

A new paper from arXiv reveals that every major Multimodal Large Language Model suffers from a 'dual-forgetting' problem that current fine-tuning methods ignore. The proposed solution, MAny, doesn't just patch the symptom—it rethinks how these models learn sequentially.
  • Researchers expose a 'dual-forgetting' phenomenon in Multimodal LLMs: perception drift in cross-modal projection and reasoning collapse in low-rank parameter space.
  • Existing continual learning methods focus only on the language backbone, missing half the problem—MAny addresses both simultaneously.
  • The paper's merging approach is elegant, but its real impact will depend on whether big labs like OpenAI or Google adopt it, or if it remains an academic curiosity.

Why Has the Industry Ignored Dual-Forgetting for So Long?

For years, the multimodal continual learning community has been obsessed with the language backbone. Every benchmark, every paper, every startup demo focused on preventing the LLM from forgetting how to reason. But the MAny paper, published on arXiv on April 15, 2026, drops a bombshell: the real problem is two-fold. The cross-modal projection space—the bridge that aligns visual features with language tokens—drifts when new tasks are introduced. Simultaneously, the low-rank parameter space collapses under sequential updates. The authors show that ignoring either leads to catastrophic failure on even simple tasks. I've been saying for months that the industry's fixation on reasoning was a red herring. This paper proves it.

Does MAny Actually Fix the Problem or Just Paper Over It?

The paper proposes a merging strategy that jointly stabilizes both spaces. The results are impressive on standard benchmarks, but here's the catch: merging is a post-hoc operation. It doesn't change the underlying training dynamics. That means MAny is a bandage, not a cure. However, it's a very good bandage. The authors report significant reductions in forgetting without sacrificing task performance. The real test will be in production environments where tasks arrive in unpredictable sequences. I expect to see follow-up work that integrates this merging logic directly into the training loop within the next 12 months.

MAny: The Paper That Exposes a Hidden MLLM Crisis

Who Benefits Most from This Discovery?

Short-term, the clear winners are academic labs and open-source projects that can adopt MAny's techniques without licensing fees. Long-term, the big winners will be companies like Hugging Face and Replicate, who can package this as a service for fine-tuning multimodal models. The losers are proprietary fine-tuning platforms that have built their business on the flawed single-space assumption—they'll need to scramble to update their offerings. OpenAI and Google, with their massive resources, could integrate something similar internally, but they have less incentive to publish their methods.

Is This the End of Catastrophic Forgetting in MLLMs?

No. But it's a significant step. Catastrophic forgetting is a fundamental challenge in neural networks, and no single paper will solve it. What MAny does is provide a clear, actionable framework for addressing the multimodal dimension. The paper's dual-space analysis is its real contribution—the merging technique is just the proof of concept. Future work will likely build on this diagnostic lens. I predict that within two years, every serious multimodal fine-tuning pipeline will include some form of dual-space monitoring.

DimensionExisting ApproachesMAny
FocusLanguage backbone onlyDual-space: perception + reasoning
MethodRegularization or replayModel merging
Forgetting type addressedReasoning collapseBoth drift and collapse
Computational overheadLow to moderateModerate (merging step)
Open source availabilityVariesExpected (arXiv paper)
VerdictIncompleteMore comprehensive, but post-hoc

Thesis: MAny's dual-forgetting analysis is a critical wake-up call, but the merging solution is a tactical fix, not a strategic revolution. The paper's real value is diagnostic: it forces the field to acknowledge that forgetting is not a single problem. In the short term, expect a flurry of papers applying dual-space analysis to other modalities like audio and video. In the long term, the winners will be the companies that can productize this diagnostic approach—think monitoring tools and automated fine-tuning pipelines. I expect Hugging Face to release a dual-space evaluation toolkit by Q4 2026, because their community-driven model aligns perfectly with this kind of analytical contribution. The losers will be any startup that has built a multimodal fine-tuning service on the old single-space assumptions—they'll need to pivot or be left behind.

  1. Hugging Face will release a dual-space forgetting monitoring toolkit for their Transformers library by Q4 2026.
  2. At least one major cloud provider (AWS, GCP, or Azure) will integrate MAny-like merging into their managed MLLM fine-tuning service by mid-2027.
  3. Startups offering proprietary multimodal fine-tuning without dual-space awareness will lose market share to open-source alternatives within 18 months.

  1. April 2026
    MAny paper published on arXiv

    Authors expose dual-forgetting in MLLMs and propose a merging-based solution.

  • Insight 1: The dual-forgetting problem is not just a technical nuance—it's a fundamental limitation that has been silently degrading the performance of every multimodal model deployed in sequential learning settings.
  • Insight 2: MAny's merging approach, while effective, is a post-hoc fix; the next breakthrough will come from integrating dual-space stability into the training objective itself.
  • Insight 3: The paper's biggest impact may be in shifting the research community's focus from language-centric forgetting to a more holistic, multi-modal view.

Source and attribution

arXiv
MAny: Merge Anything for Multimodal Continual Instruction Tuning

Discussion

Add a comment

0/5000
Loading comments...