RClicks: Realistic Click Simulation for Benchmarking Interactive Segmentation

1AIRI, 2Lomonosov Moscow State University,
*Equal contribution, +Project leader

Examples of real and predicted users' clicks of interactive segmentation task. The upper row depicts real-users clicks (green) for a given target object (white contour); the middle and bottom rows visualize, correspondingly, clicks and their distribution predicted by our clickability model. Purple points in the middle and bottom rows represent clicks generated by the baseline strategy.

Abstract

The emergence of Segment Anything (SAM) sparked research interest in the field of interactive segmentation, especially in the context of image editing tasks and speeding up data annotation. Unlike common semantic segmentation, interactive segmentation methods allow users to directly influence their output through prompts (e.g. clicks). However, click patterns in real-world interactive segmentation scenarios remain largely unexplored. Most methods rely on the assumption that users would click in the center of the largest erroneous area. Nevertheless, recent studies show that this is not always the case. Thus, methods may have poor performance in real-world deployment despite high metrics in a baseline benchmark. To accurately simulate real-user clicks, we conducted a large crowdsourcing study of click patterns in an interactive segmentation scenario and collected 475K real-user clicks. Drawing on ideas from saliency tasks, we develop a clickability model that enables sampling clicks, which closely resemble actual user inputs. Using our model and dataset, we propose RClicks benchmark for a comprehensive comparison of existing interactive segmentation methods on realistic clicks. Specifically, we evaluate not only the average quality of methods, but also the robustness w.r.t. click patterns. According to our benchmark, in real-world usage interactive segmentation models may perform worse than it has been reported in the baseline benchmark, and most of the methods are not robust. We believe that RClicks is a significant step towards creating interactive segmentation methods that provide the best user experience in real-world cases.

Benchmarking Interactive Segmentation

The goal of interactive segmention is to obtain hight-quality masks based on user inputs in multiple interaction rounds. Benchmarking interactive segmention methods requires user inputs, however, gathering real-user data is impractical.

In practice segmentation quality is assessed with a baseline clicking strategy, when in each round clicks are put in the center of the largest erroneous area. This strategy does not accuretely model real user behavior, and segmentation methods may perfom worse in real–world scenarios compared to a benchmark based on baseline clicking.

To enable more accurate evaluation of interactive segmentation methods we propose a highly realistic simulator of user clicks.

Averaged masks predicted by various segmentation methods for given real first-round clicks. These examples illustrate the sensitivity of segmentation methods to clicks locations.

Contributions

  • Multi-round interaction dataset of 475 544 clicks and methodology of its collection.
  • Novel clickability model for realistic click simulation.
  • RClicks — a benchmark for measurement of real–world annotation time and robustness of interactive segmentation methods.
  • Methodology to estimate the real-world segmentation difficulty score for each instance in a dataset.

Data Collection

Our dataset is based on DAVIS, GrabCut, COCO–MVal, Berkeley and TETRIS. Collected user inputs using Toloka.AI both on PC and mobile devices. To obtain error masks for the subsequent rounds, we applied state-of-the-art interactive segmentation methods: SAM, SimpleClick, and RITM.
Data collection pipeline.

Clickability Model

Our model predicts a clickability map. Clickability map is a singlechannel image s.t. the value of each pixel corresponds to the probability that the user will click on it.

Proposed clickability prediction pipeline.

Our model is the best compared with uniform distribution (UD), distance transform (DT), and saliency map (SM).

Examples of considered clickability models: (a) visualizes target object (white contour) and ground-truth clicks (green points); (b) – (d) depict uniform distribution (UD), distance transform (DT), and saliency map (SM) respectively; (e) – our predicted clickability map.
Evaluation of various clickability models on real-user clicks of TETRIS validation part. Our approach outperforms existing clicking strategies in terms of the proximity of samples to real-user clicks.

RClicks Benchmark

Using clickability map, we obtain clicking groups $\{G_i\}^{10}_{i=1}$. Each clicking group corresponds to clicks in some probabilty interval that has 10% of total probability mass.

Using clicking groups, we calculate the following metrics:
  • Sample NoC — mean and standard deviation of clicks number (max 20) needed to achieve 90% IoU.
  • ∆SB — relative increase in Sample NoC compared to a baseline strategy.
  • ∆GR — relative increase in annotation time between G$_1$ and G$_{10}$ clicking groups.
  • IoU noise-to-signal ratio (NSR) estimates real-world robustness on first-round collected clicks. Higher NSR values indicate more challenging segmentation instances.
Spatial distribution of clicking groups obtained from our clickability model.
Evaluation results of state-of-the-art interactive segmentation methods. Statistics of NoC20@90 on clicking groups, averaged over datasets.
A scatter plot of the mean vs. standard deviation (STD) of IoU for the first real-users clicks. Each point represents the statistics for each instance, averaged across all considered segmentation methods and real clicks. An average NSR for each dataset is provided in brackets in the legend.

Results

  • Baseline strategy underestimates the real-world annotation time from 5% up to 29%.
  • Annotation time of users from different clicking groups varies from 3% up to 79%.
  • There is currently no segmentation method that is optimal in terms of both performance and robustness on all datasets. Developers should select a method in accordance with their requirements.
  • DAVIS, with its 24.15 NSR, stands as the hardest dataset for annotation.

BibTeX

@inproceedings{antonov2024rclicks,
      title={RClicks: Realistic Click Simulation for Benchmarking Interactive Segmentation}, 
      author={Antonov, Anton  and Moskalenko, Andrey and Shepelev, Denis and Krapukhin, Alexander and Soshin, Konstantin and Konushin, Anton and Shakhuro, Vlad},
      booktitle={NeurIPS 2024},
      year={2024}
}