Comparative Algorithm Testing in Sweden for Thorax Organs at Risk

Published: 16.10.25

AI is increasingly implemented in multiple fields, including medicine. In radiation oncology, organizations like MVision AI are contributing to the development of advanced solutions that support clinical practice.

A team of researchers from Karolinska University Hospital in Stockholm, Sweden, recently published a new study in the European Journal of Medical Physics (1). The study aimed to evaluate AI-based auto-contouring models which mimic different vendors’ commercial products. The researchers provided the same dataset for training a new model, and the predictions of these custom models were compared with the local reference contours.

We will further discuss the performance of the MVision AI model in this independent research project, as well as the factors that can influence the results of testing AI-based auto-contouring solutions in clinical settings

How is auto-contouring software produced and tested?

Before reviewing the research results, we will briefly outline the general process of developing an AI-based auto-contouring solution to provide better context.

Not too long ago, atlas-based segmentation provided predefined results derived from previously collected data, relying on similarities to cases stored in reference “libraries.” Today, algorithms can improve their performance through repeated comparisons and fine-tuning.

The principles are similar to other medical fields where AI can identify structures of interest on images, such as radiology and pathology. It starts with a set of data that is used to create a reference, a set of rules that are adapted to what needs to be identified. For radiotherapy, this primary set is made by images (usually CT or MRI scans), which include sets of contoured anatomical structures. Based on these labeled images, an algorithm is trained that is able to analyze the visual information from a scan without pre-existing contours and estimate the probability that a certain pixel/voxel belongs to a certain structure.

The trained algorithm is then tested on a separate set of scans that have delineated structures following similar contouring rules as the training set.

Based on the occurring errors, the algorithm fine-tunes its performance by adding new “inner rules”. After this stage, it is usually tested again on a new set of images that have not been used in the training process. This step is known as the validation stage. The developer analyzes the performance and, if necessary, repeats the process— sometimes by adding new scans, changing the proportion of training and testing scans, or modifying the code.

These internal evaluations by the company producing the auto-contouring models are followed by clinical testing. Radiation oncologists, medical physicists, and dosimetrists evaluate the model performance on scans from their clinic, as part of research projects or preliminary stages for clinical implementation. The evaluation can be made using measurements of similarity, such as Dice score or Hausdorff distance, which show how much the manual and the automatically generated contours overlap, or how far their edges are from each other. Quality scores can also be used, involving various numbers of categories (poor to excellent), or an estimated need of modification (from accepted as is to deleting and starting from scratch). Measurements of the time needed to edit them until they become clinically acceptable are also frequently used for quality assessment.

What factors influence the clinical performance of an AI-based auto-contouring solution?

First of all, we can assume that the clinical utility of auto-contouring software is directly linked to the quality of the structures it creates. If editing an auto-generated structure takes longer than manual contouring, it does not fulfill its purpose.

There are multiple factors that influence the output of auto-contouring software. The quality of the training data and the “smartness” of the algorithm are only two of them. The accuracy of the contours used for training, the technical properties of the scans, and the variability of the training scans are also important. Other details, such as the size of the training set, post-processing steps, scanner type, the use of intravenous contrast, metallic artifacts or heavily modified anatomies can also have an influence on the accuracy of predictions.

On the other hand, when clinically evaluating the quality of the contours, the reference scans, also known as the “ground truth”, have a major role in determining the acceptability. Interobserver variation has been extensively documented. Despite the availability of expert consensus and contouring guidelines, there are national recommendations or local styles that still influence clinical contours. Even among international-level experts, there is no perfect consensus and each case’s anatomy leaves room for personal interpretation.

How was the study conducted?

Out of seven vendors invited to take part in the study, MVision and two other companies accepted to participate. The researchers provided a curated set of 250 CT scans and structure sets from lung cancer patients that had already received radiotherapy. The companies trained a basic custom model on this data, similar to their commercial models. This study design can be compared to attempting to cook the same dish using the same ingredients but following recipes from different restaurants.

Quantitative evaluation of these custom models was performed on 50 test patients. For a subset of 15 patients, a qualitative assessment was performed by a medical physicist supported by a radiation oncologist. Lastly, three clinicians independently contoured five cases to evaluate the impact on interobserver variation.

How did the MVision AI custom model perform in the Swedish evaluation?

The evaluation involved 16 thoracic volumes, including bilateral structures: heart and vessels (aorta, inferior and superior vena cava, pulmonary artery and veins), airways (trachea, main bronchi, intermediate bronchi), digestive tract structures (oesophagus, stomach), thoracic wall, and brachial plexus.

The researchers calculated volumetric Dice Similarity Coefficient (DSC), surface DSC (sDSC), the 95th percentile of the Hausdorff Distance (HD95) and average symmetric surface distance (ASSD) (sometimes referred to as average HD) .

When compared to the clinic’s manual contours, the MVision custom-made auto-contouring model produced excellent-quality structures

Details can be found in the table below.

The DSC was above 0.9 for 10 out of 16 structures and above 0.8 for another three structures.
The sDSC was above 0.9 for 5 structures and at least 0.8 for other 5.
The ASSD and HD 95 were less than or equal to 2 mm for 10 structures.

Interestingly, lower scores were generally obtained in the same volumes for all vendors, such as for the brachial plexus. The authors explained this finding by the additional challenge of identifying a relatively small volume branching structure on low-contrast CT images. However, the models produced a more coherent 3D structure for the brachial plexus than manual segmentations, which often lacked inter-slice connection.

Compared to the results of other vendors, MVision AI had top scores for 13 structures (DSC), 14 structures (sDSC), 11 structures (ASSD), and 10 structures (HD95) out of the total 16 structures.

Despite study limitations such as unicentric evaluation, possible variation in the ground truth data and using custom models (and not the commercially available ones) the study makes an important contribution to the current knowledge. Since its objective was not clinical implementation, the qualitative evaluation prioritized identifying systematic deviations between auto-segmented and ground truth contours.

Results of such evaluations are equally beneficial for clinicians and vendors. Clinicians are more aware of the possible pitfalls of automatic contouring and increase vigilance for certain structures or possible errors. Vendors have the opportunity to improve their models and to create new solutions for quality improvement, such as automatic outlier detection.

Table 1. Median values of quantitative evaluation of auto-contours provided by the custom MVision model developed for this study

Structure	DSC	sDSC	ASSD	HD95
Aorta	0.96 (0.74–0.97)	0.93 (0.70–0.98)	0.05 (0.03–7.57)	0.00 (0.00–72.98)
Brachial plexus left	0.53 (0.39–0.65)	0.69 (0.40–0.80)	1.21 (0.53–2.83)	6.27 (2.23–113.83)
Brachial plexus right	0.54 (0.35–0.64)	0.69 (0.32–0.81)	1.28 (0.52–2.79)	7.38 (2.23–123.25)
Heart	0.97 (0.94–0.98)	0.79 (0.42–0.91)	0.05 (0.03–0.12)	0.00 (0.00–2.18)
Intermediate bronchus	0.92 (0.75–0.96)	0.90 (0.57–0.99)	0.13 (0.03–1.18)	2.00 (0.00–6.39)
Main bronchus left	0.93 (0.79–0.94)	0.93 (0.79–0.98)	0.11 (0.05–0.39)	0.98 (0.00–5.30)
Main bronchus right	0.86 (0.52–0.94)	0.83 (0.38–0.95)	0.30 (0.08–1.24)	3.00 (0.98–6.00)
Oesophagus	0.88 (0.33–0.92)	0.89 (0.43–0.94)	0.19 (0.12–26.95)	1.38 (0.98–153.37)
Pulmonary artery	0.94 (0.85–0.95)	0.89 (0.57–0.96)	0.11 (0.06–2.04)	0.98 (0.00–40.67)
Pulmonary vein	0.69 (0.54–0.84)	0.63 (0.23–0.85)	3.15 (0.36–22.56)	31.83 (2.76–96.57)
Stomach	0.93 (0.74–0.97)	0.78 (0.39–0.90)	0.24 (0.05–9.38)	2.88 (0.00–75.10)
Thoracic wall left	0.96 (0.92–0.97)	0.91 (0.68–0.96)	0.08 (0.03–0.24)	0.98 (0.00–1.95)
Thoracic wall right	0.96 (0.93–0.97)	0.90 (0.69–0.96)	0.08 (0.04–0.32)	0.98 (0.00–3.00)
Trachea	0.95 (0.89–0.97)	0.88 (0.70–0.99)	0.08 (0.02–0.27)	0.98 (0.00–2.93)
Vena cava inferior	0.84 (0.73–0.91)	0.69 (0.42–0.88)	0.37 (0.11–3.60)	3.52 (0.98–57.73)
Vena cava superior	0.92 (0.85–0.96)	0.87 (0.51–0.99)	0.13 (0.05–0.42)	1.33 (0.98–3.09)

DSC – Dice Similarity Coefficient, sDSC – surface DSC, HD95 – the 95-th percentile of the Hausdorff Distance, ASSD – average symmetric surface distance

Reference

Emin S, Rossi E, Hedman M, Giovenco M, Villegas F, Onjukka E. Performance of multi-vendor auto-segmentation models for thoracic organs at risk trained on a single dataset. Phys Med. Published online August 23, 2025. doi:10.1016/j.ejmp.2025.105089

Written by

Monica-Emilia Chirilă

Radiation Oncologist

Monica is a Radiation Oncologist working in a Romanian private clinic which is part of a European Network. She is also the Managing Editor of the Journal of Medical and Radiation Oncology (JMRO), an independent researcher and a medical journalist. Her main research focus is on breast cancer, prostate cancer, education, patient reported outcomes, and AI use in Radiation Oncology.