arXiv Preprints9
AX2026-05-18T17:59:02Z
Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary as a function of action. Rather than passively processing what is seen, they actively uncover what is unseen - occluded structure, dynamics, containment, and functionality that cannot be resolved from passive sensing alone. We move beyond prior formulations of spatial intelligence that assume oracle observations by recasting the observer as an actor. We introduce ESI-BENCH, a comprehensive benchmark for embodied spatial intelligence spanning 10 task categories and 29 subcategories built on OmniGibson, grounded in Spelke's core knowledge systems. Agents must decide what abilities to deploy - perception, locomotion, and manipulation - and how to sequence them to actively accumulate task-relevant evidence. We conduct extensive experiments on state-of-the-art MLLMs and find that active exploration substantially outperforms passive counterparts, with agents spontaneously discovering emergent spatial strategies without explicit instructions, while random multi-view often adds noise rather than signal despite consuming far more images. Most failures stem not from weak perception but from action blindness: poor action choices lead to poor observations, which in turn drive cascading errors. While explicit 3D grounding stabilizes reasoning on depth-sensitive tasks, imperfect 3D representation proves more harmful than 2D baselines by distorting spatial relations. Human studies further reveal that unlike humans who seek falsifying viewpoints and revise beliefs under contradiction, models commit prematurely with high confidence regardless of evidence quality, exposing a metacognitive gap that neither better perception nor more embodied interaction alone can close.
AX2026-05-18T17:52:14Z
Here's a simpler version:
---
Robots struggle to navigate new places because they can't learn useful lessons from past experiences — each trip starts from scratch. We built **Robo-Cortex**, a system that lets robots teach themselves how to navigate better over time.
Instead of just reacting, Robo-Cortex writes down what worked and what didn't in plain language, building up a library of navigation tips it can reuse later.
**How it works:**
- **Autonomous Knowledge Induction (AKI):** Turns past trips into a structured library of navigation rules.
- **Dual-Grain Cognitive Memory:** Two layers of memory — a short-term one that tracks current progress, and a long-term one that stores reusable do's and don'ts.
- **Imagine-then-Verify loop:** Before acting, the robot simulates what might happen and a vision-language model checks the plan.
**Results:** On three benchmarks (IGNav, AR, AEQA), Robo-Cortex beats the best existing methods — up to +4.16% SPL on familiar tasks, and up to +15.30% SPL when applying its lessons to new environments. Early real-robot tests back this up.
AX2026-05-18T17:51:34Z
Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, $π_{0.5}$ obtains the highest task completion rate ($61.2%$), while $π_{0.5}$ and $π_0$ tie on scene-preserving success rate ($47.5%$). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy ($34.3%$), while GPT 5.5 obtains the best average field-wise accuracy ($66.8%$), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.
AX2026-05-18T17:50:32Z
Most VLA (Vision-Language-Action) models today only handle simple two-finger grippers or one dexterous hand. Two-finger control is easy enough without VLA, but dexterous hands really need full end-to-end learning to work well.
We built Dexora, the first open-source VLA system made for two arms and two dexterous hands at the same time.
To collect training data, we built a hybrid teleoperation setup:
- An exoskeleton backpack tracks arm motion
- Apple Vision Pro tracks finger motion (no markers needed)
- The same setup drives both a real robot and a matching MuJoCo simulation
With this, we gathered:
- 100K simulated trajectories (6.5M frames)
- 10K real teleoperated episodes (2.92M frames)
Teleop data is often noisy, so we trained a discriminator to score each clip. The diffusion-transformer policy then learns less from low-quality clips.
Results:
- 66.7% average success on dexterous tasks (vs. 51.7% for baselines)
- 90% success on basic tasks
- Strong generalization to new objects and robot bodies
Ablations show real data and the discriminator both matter for dexterity.
AX2026-05-18T17:50:22Z
This paper compares data-driven methods (N4SID, ARX, SINDYc) for modeling a tendon-driven continuum robot built at CERN. These robots are hard to model because they are nonlinear, high-dimensional, and dominated by friction. Experiments show that a simple two-degree-of-freedom model captures the dynamics well, since the joints move in strongly linked ways. The models match experimental data and work inside a model predictive controller for real-time control.
AX2026-05-18T17:59:52Z
Here's a simplified version:
---
Current attention methods like NSA and InfLLMv2 pick the top-k most relevant chunks of text, then run detailed attention on them. But this has two problems: it always picks the same number of chunks regardless of the query, and it blocks gradients from flowing between the coarse and fine stages.
We introduce **DashAttention** (Differentiable and Adaptive Sparse Hierarchical Attention). Instead of top-k, it uses α-entmax to pick a flexible number of chunks based on the query. This keeps the whole process differentiable end-to-end.
Unlike other hierarchical methods, DashAttention is non-dispersive, which helps it handle long contexts better. In LLM experiments, it matches full attention's accuracy at 75% sparsity and beats NSA and InfLLMv2, especially when sparsity is high. We also built a GPU-optimized Triton version that runs faster than FlashAttention-3 at inference.
In short, DashAttention is a cheaper way to handle long contexts.
AX2026-05-18T17:59:18Z
Pipeline parallelism is a key technique for scaling large-model training, but modern workloads exhibit runtime variability in computation and communication. Existing pipeline systems typically consume static, profiled, or adaptively generated schedules as pre-committed execution orders. When realized task readiness diverges from the pre-committed order, stages may wait for not-yet-ready work even though other executable work is available, creating stage misalignment, idle bubbles, and reduced utilization. We present Runtime-Readiness-First Pipeline (RRFP), a readiness-driven runtime for pipeline-parallel training. RRFP changes how schedules are consumed at runtime: instead of treating a schedule as a sequence that stages must wait to follow, it treats the schedule as a non-binding hint order for ranking currently ready work. To support this model, RRFP combines message-driven asynchronous communication, lightweight tensor-parallel coordination for collective consistency, and ready-set arbitration for low-overhead dispatch. We implement RRFP in a Megatron-based training framework and evaluate it on language-only and multimodal workloads at up to 128 GPUs. RRFP improves over fixed-order pipeline baselines across all settings. Using the BFW hint, RRFP achieves up to 1.77$\times$ speedup on language-only workloads and up to 2.77$\times$ on multimodal workloads. In cross-framework comparisons, RRFP with the default BF hint outperforms the faster available external system by up to 1.84$\times$ while preserving training correctness.
AX2026-05-18T17:59:00Z
Diffusion models often use inference-time guidance—like drift terms or expert reweighting—to improve sample quality for specific tasks. But most methods need repeated score or gradient evaluations, which are biased, slow, or both.
We introduce **URGE** (Unbiased Resampling via Girsanov Estimation), a derivative-free algorithm that reweights trajectories using a Girsanov change of measure. Instead of computing gradient-based particle weights, URGE attaches a simple multiplicative weight to each simulated path and resamples periodically. No scores, Hessians, or PDEs needed.
We prove that path-wise and particle-wise SMC are equivalent: the Girsanov path weight, when averaged backward, recovers the standard particle weights, so both methods produce the same unbiased result.
In experiments, URGE beats existing inference-time guidance baselines on synthetic tests and diffusion benchmarks. It generates better samples, is simpler to implement, and is fully gradient-free.
AX2026-05-18T17:57:04Z
Here's a simplified version
---
Large multimodal AI models (MLLMs) often miss small but important details in images. We noticed something interesting: when you crop the image to show just the relevant part, the same model answers correctly. When you give it the full image, it fails. So the problem isn't that the model can't recognize things — it's that it can't focus on what matters.
To fix this, we built **Vision-OPD (Vision On-Policy Distillation)**. The idea: let the model teach itself. We run the same model in two modes:
- **Teacher**: sees the cropped, relevant part of the image
- **Student**: sees the full image
The student tries to answer, and we train it to match the teacher's predictions token by token. Over time, the student learns to "zoom in" mentally without anyone cropping for it.
What's nice about this approach:
- No bigger teacher model needed
- No labeled answers needed
- No reward model needed
- No extra tools at inference time
In tests on fine-grained visual benchmarks, Vision-OPD matches or beats much larger models — including closed-source ones and agentic "Thinking-with-Images" systems.