RoboBenchMart: Benchmarking Robots in Retail Environment

Gregorii Bukhtuev, Andrey Kuznetsov, Vlad Shakhuro
FusionBrain Lab, Robotics Group
*Equal contribution, +Project leader, Team leader, Lab leader

RoboBenchMart in action — the Fetch robot operates in a realistic, cluttered retail environment

Abstract

Most existing robotic manipulation benchmarks focus on simplified tabletop scenarios, typically involving a stationary robotic arm interacting with various objects on a flat surface. To address this limitation, we introduce RoboBenchMart, a more challenging and realistic benchmark designed for dark store environments, where robots must perform complex manipulation tasks with diverse grocery items. This setting presents significant challenges, including dense object clutter and varied spatial configurations — with items positioned at different heights, depths, and in close proximity. By targeting the retail domain, our benchmark addresses a setting with strong potential for near-term automation impact. We demonstrate that current state-of-the-art generalist models struggle to solve even common retail tasks. To support further research, we release the RoboBenchMart suite, which includes a procedural store layout generator, a trajectory generation pipeline, evaluation tools and fine-tuned baseline models.

Robotics Simulation for Retail

Real-world evaluation is also difficult to standardize, often requiring human resets and suffering from environment variability. As a result, simulation-based benchmarks have become popular for their reproducibility and ease of use.

Existing benchmarks mostly focus on household tasks. However, retail and logistics scenarios — such as shelf picking or order packing — remain underexplored. Dedicated benchmarks for these domains are needed to advance robotic capabilities in retail environments.

RoboBenchMart addresses limitations of prior works by providing code to generate diverse store layouts and robotic trajectories, enabling the training and benchmarking of robotic policies in retail environments.

Comparing proposed robotics retail benchmark with other benchmarks and datasets

Contributions

  • Store Plan Generator an open procedural pipeline for generating realistic and diverse store layouts and product arrangements. It enables scalable creation of retail environments for training and evaluating robotic policies.
  • Store Trajectories Sampler a pipeline that automatically collects trajectories for common retail tasks using motion planning and reinforcement learning methods. Moreover, we release a dataset of synthetic trajectories generated for the Fetch robot embodiment.
  • Store Robotics Benchmark the first open benchmark dedicated to evaluating robotic policies in retail environments. Using our benchmark, we demonstrate that current state-of-the-art models struggle to complete typical retail tasks.

Store Plan Generator

We simulate dark-store environments as warehouse-style spaces filled with shelving and refrigeration units in diverse, randomized layouts. To support domain randomization, we vary wall, floor, and ceiling textures and use multiple fixture designs. Product items are then placed on shelves in realistic, slightly perturbed poses to mimic natural variability.
Fixture Arrangement. We seed a rectangular floor plan with pallets, boxes, and freezers. Rejection sampling guarantees collision-free initial placement. Next, we compute a smooth tensor field from store boundaries and fixture polygons. This field encodes local “flow” directions that naturally align aisles and corridors. Shelving is then placed in two passes—horizontal first, vertical second—following the local field direction. Every placement enforces clearance and minimum passage widths for navigation. A small probabilistic skip adds variety between scenes. The result is a clean, navigable layout that still looks different from run to run.
Sampled tensor field
Resulted fixture layout
Generated store
Examples of generated store with fixtures arranged by our pipeline
Product Arrangement. We use scene_synthesizer to detect shelf surfaces suitable for item placement. Items are placed on a grid with slight pose jitters for realism. The module supports vertical stacking to match common retail patterns. It can also leave front-edge gaps via a Poisson process to mimic partial depletion over time.
1st day
2nd day
4th day
8th day
Example of product arrangement and shelf depletion over time produced by our simulator
Examples of collected product assets
Assets & Textures. Our asset set includes 3 shelf models, 2 refrigerator models, and 370 product items across 21 categories. We also use 26 floor, 17 wall, and 15 ceiling textures for visual diversity. All assets are normalized for orientation and scale using retail reference dimensions. This keeps proportions realistic across scenes. To maintain performance with hundreds of items, we run an automatic mesh-simplification pipeline (QuadriFlow, Marching Cubes, and shape-aware approximations via the Blender API). From Pareto candidates, we pick meshes that minimize geometry error while maximizing triangle reduction. All curated and optimized assets are released publicly.
Examples of ceiling, wall, and floor textures used in our store generation pipeline, illustrating just a subset of possible variations
Example of different geometry approximations for assets (original on the left). Numbers above indicate face count for each mesh

Store Trajectories Sampler

We generate training trajectories for dark-store tasks using two sources: motion planning and reinforcement learning (RL). The resulting demos are suitable for imitation learning pre-training or fine-tuning.
Motion Planning. For each task, we define randomized anchor poses (start, intermediates, goal) to diversify demonstrations. The planner solves each segment between anchors in sequence. When the arm only moves, we attempt a fast screw motion (no obstacle awareness) and validate for collisions. If invalid, we fall back to RRT-Connect with obstacle checks. If both fail, we reset the scene and resample. When mobile base motion is needed, we use task-specific safety-aware heuristics to route the base and execute the arm locally. Result: feasible trajectories in ~60% of attempts across tasks.
Reinforcement Learning. We train per-task policies with privileged state and PPO, using hand-crafted rewards that balance target proximity, correct placement, and collision avoidance with shelf items. Result: feasible trajectories in ~60% of rollouts across tasks, providing complementary demos to the planner for imitation learning.
Examples of heuristically generated anchor poses used in our motion planner

Store Robotics Benchmark

We evaluate state-of-the-art generalist policies in retail settings, fine-tuned on trajectories from our sampler. The benchmark runs on ManiSkill3 for fast, realistic physics and ray-traced rendering. We use the Fetch mobile manipulator: differential-drive base, 7-DOF arm, and a prismatic torso for vertical reach. A parallel gripper enables reliable, generic grasping.
Testing Scenarios. We probe generalization along robot start pose, textures, store layouts, shelf arrangements, and item novelty. We report three tiers: In-Domain (start pose only), Unseen Scenes (start pose + textures + layouts), and Unseen Scenes & Items (plus OOD items from other tasks). Harder settings with completely unseen items/shelf layouts are supported but excluded—current policies fail even earlier tiers.
Tasks. Each task uses a text instruction with target item/fixture names. We check goal achievement and penalize unwanted collisions or scene disturbance.
Atomic tasks: pick to basket, pick from floor, from board to board, open fridge, close fridge.
Composite tasks: pick {N} items; pick from fridge (open → pick → close).
Generalist Baselines & Data. We fine-tune Octo and π₀ with imitation learning on trajectories from our sampler. To keep compute modest, we use 248 trajectories per (task, item, fixture)—2,480 demos total. To test cross-task generalization, we train on only 2–3 objects per task and keep shelves fully packed in train/test. Models are fine-tuned on atomic tasks; composites are executed as sequences of atomic instructions.
Evaluation results. We report mean success rate per task with 50 trials per (task, item, fixture) triplet. Octo fails across scenarios. $\pi_{0}$ is modestly successful in-domain but degrades on Unseen Scenes and drops to near-zero in Unseen Scenes & Items. $\pi_{0.5}$ performs significantly better and is the only model that achieves non-zero success rates in the Unseen Scenes & Items scenario, though it remains far from reliable. Composite tasks are effectively zero for all models.
Average success rates (%) of generalist VLA models on atomic and composite retail tasks across different testing scenarios. Higher values indicate better performance. n/a indicate that scenario is not applicable for the task

Results highlight key limitations of existing generalist models: fragility to minor scene changes (e.g., layouts, textures, object placements), poor generalization from limited demonstrations to novel object-task combinations, and inadequate support for long-horizon, compositional execution. Our findings suggest that existing pretrained models may be insufficient for effective application in the retail domain, and that targeted pretraining on retail-specific data may be necessary.

BibTeX

@article{soshin2025robobenchmart,
      title={RoboBenchMart: Benchmarking Robots in Retail Environment},
      author={Soshin, Konstantin and Krapukhin, Alexander and Spiridonov, Andrei and Shepelev, Denis and Bukhtuev, Gregorii and Kuznetsov, Andrey and Shakhuro, Vlad},
      journal={arXiv preprint arXiv:2511.10276},
      year={2025}
    }