Motubrain benchmark coverage currently centers on two sources: WorldArena for embodied world model evaluation and RoboTwin 2.0 for bimanual robotic manipulation. ShengShu Technology's April 29, 2026 launch announcement reports a 63.77 EWM Score on WorldArena and an average RoboTwin 2.0 score of 96.0 across 50 predetermined tasks, with randomized-environment performance above 95.0.
Those figures are reported here as source-attributed claims. Motubrain.org has not independently rerun the benchmarks.
| Benchmark | What it evaluates | Source-attributed Motubrain claim | How to read it |
|---|---|---|---|
| WorldArena | Embodied world models across video quality and functional utility | 63.77 EWM Score reported in the launch announcement | A higher EWMScore indicates stronger aggregate performance under WorldArena's normalized metric. |
| RoboTwin 2.0 | Dual-arm robotic manipulation with synthetic data, domain randomization, and multiple embodiments | 96.0 average across 50 predetermined tasks, plus randomized-environment performance above 95.0 reported in the launch announcement | Scores should be read with the task set, simulator, robot embodiments, and randomization settings in mind. |
WorldArena describes itself as a unified benchmark for evaluating embodied world models across perceptual and functional dimensions. Its evaluation covers video perception quality, embodied task functionality, and human evaluation. The project page says EWMScore is the arithmetic mean of 16 normalized base metrics, scaled from 0 to 100, where higher scores indicate stronger overall performance.
For Motubrain, the important reading posture is conservative: the score is useful as a public claim and comparison point, but the benchmark methodology matters more than the rank alone. Review the WorldArena project page and leaderboard submission process before treating a score as a deployment guarantee.
RoboTwin 2.0 is a scalable data generator and benchmark for robust bimanual robotic manipulation. The project describes a 50-task benchmark built on the RoboTwin Object Dataset, with 731 objects across 147 categories and support for five robot embodiments. The accompanying arXiv abstract also describes structured domain randomization across clutter, lighting, background, tabletop height, and language.
That context matters because a manipulation benchmark is not just a single number. The score depends on task definitions, simulator setup, robot configuration, visual variation, language variation, and whether the model was evaluated in clean or randomized conditions.
The reported results do not by themselves prove that Motubrain is available as a public product, safe for unsupervised real-world robotics, or reproducible by outside teams. They also do not replace hardware-specific validation. A robotics team would still need to inspect access terms, integration requirements, test conditions, safety layers, and failure behavior.