ArXiv preprint

Multisensory Continual Learning: Adapting Pretrained Visuomotor Policies to Force

Stanford University

Adding new sensor modalities is key for contact-rich tasks, but we lack the data to scale multisensory foundation models. What if all our data doesn't need to contain multisensory observations? With MuSe, multitask visuomotor policies learn new modalities during finetuning with only a fraction of the pretraining data. We find MuSe:

  • improves visual generalization from broad-scale pretraining (forward transfer)
  • leverages new modality to improve on pretraining tasks that do not exist in minimal finetuning stage (backward transfer)

MuSe provides a scalable pathway to training large-scale multisensory robot policies and world models with practical data composition: large amounts of visual pretraining data, robot vision-action data, and a small amount of diverse multisensory data such as force, tactile, or audio.

How does MuSe work?

Incorporating new modalities requires the model to reason over new input dimensions, without either over-relying on or ignoring them. This requires an algorithm that avoids catastrophic forgetting of knowledge acquired during pretraining while properly incorporating new input modalities through an effective cross-modal representation. We design MuSe to achieve this using three key mechanisms:

  • 1
    Multisensory Prediction Targets: The model predicts future actions, future visual observations, and future F/T signals, encouraging a shared representation grounded in both visual dynamics and contact.
  • 2
    Multi-stage (early + late) fusion: Language, images, and actions are combined with token concatenation (early fusion). F/T conditioning is further amplified via cross-attention adapters (late fusion).
  • 3
    Experience replay: Finetuning mixes new multisensory data with the original pretraining data. The original pretraining data does not contain F/T inputs, so we use optional masking in replay when F/T is not available. This enables preserving old skills while adding the new modality.
MuSe method detail figure.

How is MuSe trained and evaluated?

Dataset overview: 1,271 pretraining episodes across 21 tasks and 434 force-labeled finetuning episodes across 5 contact-rich tasks.

Pretraining uses 1,271 vision-action episodes across 21 tasks. Finetuning adds 434 force-labeled episodes across 5 contact-rich tasks.

We evaluate MuSe in the context of multisensory continual learning: the challenge of extending a pretrained policy to incorporate new sensory modalities for new tasks, while retaining performance on the original tasks. To do so we evaluate representation learning, forward transfer (performance on finetuning tasks), and backward transfer (performance on pretraining tasks with new sensor modality).

Results overview.

Does MuSe learn effective cross-modal representations?

MuSe predicts force where force was never supervised.

After learning from limited multisensory finetuning data, MuSe can predict F/T traces on pretraining tasks that only used vision-action supervision.

  • Shared multisensory representation: F/T prediction on unlabeled pretraining tasks suggests the model learns a shared multisensory representation that transfers beyond the finetuning distribution.
  • Adaptive compliance at runtime: predicted future F/T provides a contact profile that can set compliance targets during deployment.
Orange curves show MuSe force predictions aligned with measured F/T signals on pretraining tasks where force labels were withheld during training.

MuSe leverages F/T sensing to improve on finetuning tasks

+35% over baselines without F/T on contact-rich finetuning tasks

MuSe improves performance on pretraining tasks

+20% over pretrained model on pretraining tasks

Q&A

Why F/T?

MuSe uses F/T in two complementary ways. As an input, force histories tell the policy when contact has started, whether the robot is pushing too hard, and when a peg has jammed. As an output, predicted future F/T gives the controller an expected contact profile that can be converted into adaptive compliance targets, so the robot can stay stiff in free space and become compliant when contact is anticipated.

We find that even despite imperfect F/T predictions, MuSe improves success rate on contact rich tasks — approximate magnitude and direction is all that is needed to improve forceful interactions.

Why care about backward transfer?

New sensor modality should extend a robot's capabilities, enabling the robot to continually adapt to new skills without forgetting old ones. Backward transfer confirms that MuSe representations generalize beyond the finetuning distribution — the model learns to use force not just on the tasks it was trained with, but also on the diverse set of pretraining tasks where no force labels existed.

Why not train at once?

Sensors are constantly developing and existing large-scale pretraining datasets do not contain them. Re-collecting all pretraining data every time a new sensor becomes available is impractical at scale. MuSe is designed as a realistic framework to tackle this challenge: learn from large vision-action pretraining data, then incorporate new modalities through a small multisensory finetuning stage without starting over.

BibTeX

@inproceedings{clark2026multisensory,
  title     = {Multisensory Continual Learning: Adapting Pretrained Visuomotor Policies to Force},
  author    = {Jaden Clark and Changhao Wang and Yihuai Gao and Seongheon Hong and Hojung Choi and Mark Cutkosky and Yifan Hou and Shuran Song},
  booktitle = {arXiv preprint},
  year      = {2026}
}