MTurn-Seg: A Large-Scale Bilingual Medical Benchmark for Multi-Turn Reasoning Segmentation

Central South University, Changsha, China
Artificial Intelligence and Robotics Laboratory (AIRLab)
IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2025

*Indicates Equal Contribution

Abstract

Multi-turn reasoning segmentation is essential for mimicking real-world clinical workflows, where anatomical structures are identified through step-by-step dialogue based on spatial, functional, or pathological descriptions. However, the lack of a dedicated benchmark in this area has limited progress. To address this gap, we introduce the first bilingual benchmark for multi-turn medical image segmentation, supporting both Chinese and English dialogues. The benchmark consists of 28,904 images, 113,963 segmentation masks, and 232,188 question–answer pairs, covering major organs and anatomical systems across CT and MRI modalities. Each dialogue requires the model to infer the segmentation target based on prior conversational turns and previously segmented regions. We evaluate several state-of-the-art models, including MedCLIP-SAM, LISA, and LISA++, and report three key findings: (1) existing models perform poorly on our benchmark, far below clinical usability standards; (2) performance degrades as dialogue turns increase, reflecting limited multi-turn reasoning capabilities; and (3) general-purpose models such as LISA can outperform medical-specific models, suggesting that further integration of domain knowledge is needed for specialized medical applications.

Research Motivation and Overview of the MTurn-Seg Benchmarks

Image path error
A. Research Motivation. B. Major organs covered by the benchmark. C. Distribution of dialogue turns. D. Distribution of imaging modalities. E. Bilingual word clouds. F. Distribution of QA pairs across human organ systems.

Dialogue Data Generation

Image path error
We extracted object centroids and areas to compute positions and relations. Clinicians wrote functional and pathological descriptions in Chinese and English, giving at least 20 variants per organ or lesion. This produced image–mask–text triplets. We then formed multi-turn samples by conditioning each round on previous segmentations, supporting both medical QA and new segmentations with cross-turn references. Data were built in three steps: write templates, rewrite with an LLM, and manually review a sample for quality.

Multi-Turn Dialogue Example

Image path error
The example figure shows bilingual multi-turn medical dialogues that include three types of information—spatial localization, functional information, and pathological information—to drive cross-turn reasoning-based segmentation.

Single-turn Reasoning Segmentation(CN)

Image path error

Single-turn Reasoning Segmentation(EN)

Image path error

Multi-turn Reasoning Segmentation

Image path error

Video Presentation

BibTeX

@InProceedings{MULTISEGBIBM2025,
  title={MTurn-Seg: A Large-Scale Bilingual Medical Benchmark for Multi-Turn Reasoning Segmentation},
  author={Haitao Nie, Yimeng Zheng, Ying Ye, Bin Xie},
  booktitle ={IEEE International Conference on Bioinformatics and Biomedicine (BIBM) },
  year={2025},
}