Abstract
Multi-turn reasoning segmentation is essential for mimicking real-world clinical workflows, where anatomical structures are identified through step-by-step dialogue based on spatial, functional, or pathological descriptions. However, the lack of a dedicated benchmark in this area has limited progress. To address this gap, we introduce the first bilingual benchmark for multi-turn medical image segmentation, supporting both Chinese and English dialogues. The benchmark consists of 28,904 images, 113,963 segmentation masks, and 232,188 question–answer pairs, covering major organs and anatomical systems across CT and MRI modalities. Each dialogue requires the model to infer the segmentation target based on prior conversational turns and previously segmented regions. We evaluate several state-of-the-art models, including MedCLIP-SAM, LISA, and LISA++, and report three key findings: (1) existing models perform poorly on our benchmark, far below clinical usability standards; (2) performance degrades as dialogue turns increase, reflecting limited multi-turn reasoning capabilities; and (3) general-purpose models such as LISA can outperform medical-specific models, suggesting that further integration of domain knowledge is needed for specialized medical applications.
Research Motivation and Overview of the MTurn-Seg Benchmarks
Dialogue Data Generation
Multi-Turn Dialogue Example
Single-turn Reasoning Segmentation(CN)
Single-turn Reasoning Segmentation(EN)
Multi-turn Reasoning Segmentation
Video Presentation
BibTeX
@InProceedings{MULTISEGBIBM2025,
title={MTurn-Seg: A Large-Scale Bilingual Medical Benchmark for Multi-Turn Reasoning Segmentation},
author={Haitao Nie, Yimeng Zheng, Ying Ye, Bin Xie},
booktitle ={IEEE International Conference on Bioinformatics and Biomedicine (BIBM) },
year={2025},
}