Multisensory Continual Learning

How does MuSe work?

Incorporating new modalities requires the model to reason over new input dimensions, without either over-relying on or ignoring them. This requires an algorithm that avoids catastrophic forgetting of knowledge acquired during pretraining while properly incorporating new input modalities through an effective cross-modal representation. We design MuSe to achieve this using three key mechanisms:

1
Multisensory Prediction Targets: The model predicts future actions, future visual observations, and future F/T signals, encouraging a shared representation grounded in both visual dynamics and contact.
2
Multi-stage (early + late) fusion: Language, images, and actions are combined with token concatenation (early fusion). F/T conditioning is further amplified via cross-attention adapters (late fusion).
3
Experience replay: Finetuning mixes new multisensory data with the original pretraining data. The original pretraining data does not contain F/T inputs, so we use optional masking in replay when F/T is not available. This enables preserving old skills while adding the new modality.

How is MuSe trained and evaluated?

Dataset overview: 1,271 pretraining episodes across 21 tasks and 434 force-labeled finetuning episodes across 5 contact-rich tasks.

Pretraining uses 1,271 vision-action episodes across 21 tasks. Finetuning adds 434 force-labeled episodes across 5 contact-rich tasks.

We evaluate MuSe in the context of multisensory continual learning: the challenge of extending a pretrained policy to incorporate new sensory modalities for new tasks, while retaining performance on the original tasks. To do so we evaluate representation learning, forward transfer (performance on finetuning tasks), and backward transfer (performance on pretraining tasks with new sensor modality).

Does MuSe learn effective cross-modal representations?

MuSe predicts force where force was never supervised.

After learning from limited multisensory finetuning data, MuSe can predict F/T traces on pretraining tasks that only used vision-action supervision.

Shared multisensory representation: F/T prediction on unlabeled pretraining tasks suggests the model learns a shared multisensory representation that transfers beyond the finetuning distribution.
Adaptive compliance at runtime: predicted future F/T provides a contact profile that can set compliance targets during deployment.

Orange curves show MuSe force predictions aligned with measured F/T signals on pretraining tasks where force labels were withheld during training.

MuSe leverages F/T sensing to improve on finetuning tasks

+35% over baselines without F/T on contact-rich finetuning tasks

Vase Wiping3x playback1 / 3

To wipe the vase, the policy leverages F/T sensing to sense when contact with the vase is made and adjusts its compliance to apply sufficient force to maintain contact and follow the curved surface, but not so much force that it jams into the vase and breaks the F/T limit set on the robot arm.

MuSe

77% success

✅

Generalize to erasing orange drawing that doesn't exist in finetuning data.

✅

Tracks the middle surface and regulates pressure through the wipe.

✅

Generalize to a larger eraser without applying too much force.

✅

Successfully wipes the vase with a larger eraser while maintaining safe contact.

Use force feedback to maintain contact with the curved vase.
Adjust compliance to maintain forceful but safe contact across drawing configurations and colors.

No Pretrain 53% No F/T 33%

Less robust to different drawing appearances and vase variations.

Applies too little force, leaving the drawing partially unerased.

Fails to generalize to an out-of-distribution orange drawing.

Gets stuck against the vase after reaching the F/T safety limit.

Fail to generalize to unseen drawings and vase variations without pretraining.
Apply insufficient force or exceed the F/T safety limit without force feedback.

Peg Insertion3x playback2 / 3

The policy must fully insert the peg into a hole. The hole angle varies between ±20 degrees and contains high friction and serrations. This means the policy must leverage F/T feedback to assess the direction it must push into the hole and adjust its compliance to stay stiff in the direction of the hole but remain compliant in other directions. The policy can also leverage F/T feedback to find the hole if the initial insertion is imprecise.

MuSe

87% success

✅

Adapts to a backward-tilted hole and completes insertion.

✅

Predicts high force before contact - decreasing stiffness and preventing robot from jamming and breaking F/T limit.

✅

Identify correct hole orientation and fully insert peg.

✅

Peg insertion holes at varying angles with internal serrations.

Policy adapts to insert peg in holes at varying angles (±20 degrees) with serrations that easily cause jams.

Use F/T feedback to precisely insert the peg, identify hole angle, and escape serrations.
Generalize to different peg sizes.

No Pretrain 40% No F/T 60%

Misses the insertion target.

Breaks the F/T limit during insertion.

Gets stuck after contact due to high forces and struggles to generalize without pretraining.

Cannot adapt its compliance profile and breaks the F/T limit on impact.

Miss the insertion target or struggle to generalize without broad visual pretraining.
Jam or get stuck without force feedback for contact-aware adaptation.

Pick and Place3x playback3 / 3

The policy must place the can in the bowl with various distractors and bowl color changes. The policy without pretraining struggles to generalize.

MuSe

75% success

✅

Robustness to distractors

✅

Robustness to object colors

✅

Robustness to object configurations

✅

Robustness to object configurations

Remain robust to visual changes and distractors.

No Pretrain

50% success

Struggles with distractors and task variation without broad visual pretraining.

Decreased performance under visual changes.

Fail with distractors and visual changes without pretraining.

MuSe improves performance on pretraining tasks

+20% over pretrained model on pretraining tasks

Whiteboard Wiping3x playback1 / 3

To erase the drawing, the policy leverages F/T sensing to sense when contact with the board is made and adjusts its compliance to apply sufficient force to erase the drawing, but not so much force that it jams into the table, moves the whiteboard, or breaks the F/T limit set on the robot arm.

MuSe

83% success

✅

Adjusts compliance profile to maintain forceful but safe contact.

✅

Adapt to varying initial conditions.

✅

Erase drawings in varying locations.

✅

Erase drawings in varying locations.

Use F/T feedback to identify contact with the board.
Accurate F/T predictions enable compliance adaptation for forceful but safe contact.

Pretrain Only 57% No ER 3%

No F/T feedback means the model cannot adapt its compliance - resulting in breaking F/T limit on contact.

Forgets the original wiping behavior after multisensory finetuning.

Applies insufficient force (no F/T feedback means it cannot react to forces or adjust stiffness).

Forgets the original wiping behavior after multisensory finetuning.

Break the F/T limit, jam into the board, or fail to detect board contact.
Forget the task and fail after finetuning.

Peg Insertion6x playback2 / 3

The policy must leverage F/T sensing to adjust its compliance, find the direction of the hole, and release the peg from internal high-friction serrations. Backward transfer tests pegs with different sizes, shapes (rectangular and small cylindrical pegs that demand different strategies than the finetuning pegs), and colors to assess whether the model preserves strategies from pretraining that may interfere with finetuning strategies.

MuSe

67% success

✅

Inserts a red rectangular peg in a green hole (configuration seen in pretraining but not finetuning).

✅

Inserts a red rectangular peg in a green hole (configuration seen in pretraining but not finetuning).

✅

Regulate contact to insert small circular purple peg in white hole (configuration seen in pretraining but not finetuning).

✅

Regulate contact to insert small circular purple peg in white hole (configuration seen in pretraining but not finetuning).

Predict high-F/T interactions and adjust compliance to prevent jams.
Leverage F/T feedback to precisely identify the hole.

Pretrain Only 53% No ER 13%

Fails to leverage contact for proper insertion.

Overfits to a finetuning-specific twisting strategy and misses the rectangular peg insertion.

Model without contact signal fails to insert the purple peg.

Loses robust pretraining behavior and drifts away from hole.

Jam into serrations or fail to use contact for insertion.
Overfit to finetuning strategies that fail on pretraining peg configurations.

Pick and Place3x playback3 / 3

The policy must pick and place object configurations that exist in pretraining data but not in finetuning data. This assesses whether the model preserves generalization capabilities from pretraining.

MuSe

65% success

✅

Correctly place the pepper in the pot.

✅

Correctly place the pepper in the pot.

✅

Correctly place the pepper on the plate.

✅

Correctly place the pepper on the plate.

Adapt to various object configurations.

No ER

55% success

Model trained without ER lacks robustness on pretraining tasks.

Forgets task structure from pretraining.

Forget task structure or lose robustness on pretraining tasks.

Q&A

Why F/T?

MuSe uses F/T in two complementary ways. As an input, force histories tell the policy when contact has started, whether the robot is pushing too hard, and when a peg has jammed. As an output, predicted future F/T gives the controller an expected contact profile that can be converted into adaptive compliance targets, so the robot can stay stiff in free space and become compliant when contact is anticipated.

We find that even despite imperfect F/T predictions, MuSe improves success rate on contact rich tasks — approximate magnitude and direction is all that is needed to improve forceful interactions.

Why care about backward transfer?

New sensor modality should extend a robot's capabilities, enabling the robot to continually adapt to new skills without forgetting old ones. Backward transfer confirms that MuSe representations generalize beyond the finetuning distribution — the model learns to use force not just on the tasks it was trained with, but also on the diverse set of pretraining tasks where no force labels existed.

Why not train at once?

Sensors are constantly developing and existing large-scale pretraining datasets do not contain them. Re-collecting all pretraining data every time a new sensor becomes available is impractical at scale. MuSe is designed as a realistic framework to tackle this challenge: learn from large vision-action pretraining data, then incorporate new modalities through a small multisensory finetuning stage without starting over.

BibTeX

@inproceedings{clark2026multisensory,
  title     = {Multisensory Continual Learning: Adapting Pretrained Visuomotor Policies to Force},
  author    = {Jaden Clark and Changhao Wang and Yihuai Gao and Seongheon Hong and Hojung Choi and Mark Cutkosky and Yifan Hou and Shuran Song},
  booktitle = {arXiv preprint},
  year      = {2026}
}

Acknowledgements

Jaden Clark is supported by the Knight-Hennessy Fellowship and the NSF Graduate Research Fellowships Program (GRFP). This work was supported in part by the NSF Award #2143601, #2037101, and #2132519, Toyota Research Institute, Amazon, Stanford System-X, Stanford HAI Center.

We would like to thank Mandi Zhao and Maximilian Du for presentation feedback as well as all members of the REAL lab at Stanford for their detailed feedback on paper drafts and experiment directions. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.