Modular Interactive Video Object Segmentation:
Interaction-to-Mask, Propagation and Difference-Aware Fusion
Video Results (Full Playlist)
(a) DAVIS interaction track results

Results from ATNet are extracted from their open-sourced code. Interactions are provided by the official DAVIS evaluation robot.

bmx-trees, DAVIS 2017 validation set.
kite-surf, DAVIS 2017 validation set.
pigs, DAVIS 2017 validation set.
scooter-black, DAVIS 2017 validation set.
soapbox, DAVIS 2017 validation set.

(b) Real user-interaction processes

These show the entire interaction process with our algorithm.

For objects with complex structure, users can combine different interaction techniques (clicks, scribbles, local control) to achieve accurate results.
Using clicks with f-BRS can be highly efficient in annotating objects with clear structure.
Scribbles and clicks can be used in conjunction easily.

(c) Real user-interaction results

We experiment with the generalizability and robustness of our method by testing it on videos collected from the Internet. Our method works well even outside of the DAVIS dataset.

151 frames with 6 objects. User time: ~180s. The two fighters have close to mirrored appearance.
216 frames with 4 objects. User time: just under 6 seconds!
130 frames with 2 objects. User time: ~60s. Thin structure like the legs of the chair can be well-captured.
252 frames with 3 objects. User time: ~60s. Interaction between moving objects does not pose a major challenge to our method.
181 frames with 3 objects. User time: ~35s. All pandas look the same but we can still handle it efficiently.
168 frames with 1 object. User time: ~40s. Occlusion from objects with similar appearance is difficult to handle for iVOS methods.
132 frames with 3 objects. User time: ~20s.