Economy & Markets
118 min read
Deep Learning & Radiomics for Extracranial Carotid Plaque Detection: A Meta-Analysis
Journal of Medical Internet Research (JMIR)
January 22, 2026•4 hours ago

AI-Generated SummaryAuto-generated
A systematic review and meta-analysis found that deep learning and radiomics algorithms effectively diagnose extracranial carotid plaque. Both methods demonstrated high diagnostic performance, with overall pooled sensitivity and specificity of 0.88 and 0.89, respectively. However, research design irregularities and a lack of multicenter studies and external validation limit the robustness of these findings.
Introduction
Extracranial carotid plaques are biomarkers of coronary artery disease and cerebral ischemic events, including ischemic heart disease and stroke. The global prevalence of carotid plaques among individuals aged 30‐79 years is estimated at 21.1% (n=815.76 million) in 2020. This high prevalence reflects a growing global burden of cardiovascular and cerebrovascular diseases, posing a significant challenge to public health systems [ ]. Therefore, early detection and management of carotid plaque can potentially reduce the risk of stroke and cardiovascular events [ - ], and thus, effective detection and classification technologies need to be prioritized.
Imaging methods for carotid plaque imaging, such as ultrasound, computed tomography angiography (CTA), magnetic resonance imaging (MRI), and digital subtraction angiography, facilitate detection, stenosis assessment, and plaque composition analysis [ ]. Conventional ultrasound is the first-line screening method [ ]. Studies show that periapical radiographs (PRs) can serve as a supplementary screening tool, demonstrating a 50% concordance with ultrasound or CTA [ - ]. Current imaging primarily identifies high-risk features, such as plaque neovascularity, lipid-rich necrotic cores, thin fibrous caps, and intraplaque hemorrhage plaque ulceration [ , ]. Among them, the contrast-enhanced ultrasound or superb microvascular imaging can accurately quantify neovascularization and correlates well with histopathology [ - ], offering rapid, noninvasive, and reliable quantification [ ]. It is proficient in vascular imaging and ulcer detection [ ], as well as stenosis assessment [ ], but it faces challenges with small lipid cores and thin fibrous caps [ ]. MRI remains the gold standard for assessing plaque composition, particularly for identifying lipid cores and intraplaque hemorrhage [ ]. While digital subtraction angiography is the reference standard, its invasive nature limits its application. Notably, the accuracy of these diagnostic techniques largely relies on the expertise of imaging or clinical physicians, which causes inconsistencies in the assessment results of carotid atherosclerotic plaques—particularly in measuring carotid intima-media thickness, characterizing intraplaque components, and evaluating fibrous cap integrity.
The radiomics algorithms and deep learning (DL) models have demonstrated significant potential in medical image analysis [ ]. Radiomics is a quantitative medical imaging analysis approach that aims to transform high-dimensional image features (such as texture heterogeneity, spatial topological relationships, and intensity distribution) into quantifiable digital biomarkers, thereby providing objective evidence to guide clinical decision-making. However, the characteristic dimensionality of radiomics data often far exceeds sample sizes, which renders the traditional statistical methods inadequate [ ]. Machine learning (ML), with the potential to process large-scale, high-dimensional data and uncover deep correlations among these complex features [ ]. Combining radiomics with ML to develop an ML model using radiomics can enhance the diagnostic performance of AI in large and complex datasets, exceeding the performance of models constructed through traditional statistical methods.
DL is also one of the important subbranches of artificial intelligence, which can automatically learn and layer from raw data without manual design of features, ultimately generating predictions via an output layer [ ]. DL-driven image generation techniques have demonstrated remarkable effectiveness in cross-modality imaging and synthesis tasks across various sequences within the same modality. With the rapid development of computer technology, ML models based on radiomics and DL models based on radiomics have become important tools for cardiovascular disease research. Current evidence suggests that these methods can significantly improve the quantitative assessment accuracy of atherosclerotic plaque progression and enhance the diagnostic and predictive power of major adverse cardiovascular events [ - ]. In recent years, research on the application of these methods in the fields of plaque diagnosis, stability assessment, and symptomatic plaque identification has increased significantly. Although these advancements have significantly improved the diagnosis of carotid plaques, variations in data dependency and imaging configurations among different models create inconsistencies in diagnostic accuracy. Moreover, these models may become overly specialized in common imaging configurations, even when using radiomics data from identical sources. Currently, systematic evaluations of its clinical validity remain limited.
Therefore, this systematic review comprehensively assesses the applications of ML models based on radiomics algorithms and DL models in carotid plaques, while highlighting gray areas in the available literature.
Methods
Study Registration
The study was performed in line with the PRISMA-DTA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy Studies) guidelines [ ] and PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) standards [ , ] and was registered on the International Prospective Register of Systematic Reviews (PROSPERO CRD42025638492).
Data Sources and Search Strategy
Relevant articles were searched on PubMed, Embase, Web of Science, Cochrane Library, and Institute of Electrical and Electronics Engineers (IEEE) databases, focusing on English-language articles published up to September 24, 2025. The literature search was based on the PIO (population, intervention, and outcomes) principles: “P” represents carotid artery disease, carotid plaques, or atherosclerosis populations; “I” represents radiomics or DL as interventions; and “O” represents the outcomes of diagnosis and their subordinates and other keywords. Furthermore, we manually analyzed the reference lists of all included articles to identify additional relevant publications. The complete search strategy is outlined in Table S1 in . The EndNote 20 software (Clarivate Analytics) was used to manage the included studies.
Eligibility Criteria
Inclusion Criteria
The inclusion criteria included:
Studies on patients with extracranial carotid plaques that aimed to detect or distinguish between unstable and symptomatic plaques, among other factors.
Studies using radiomics algorithms or DL models based on medical imaging techniques, such as ultrasound, CTA, or MRI, to diagnose carotid plaques.
Studies reported the diagnostic performance metrics, including confusion matrix, 2×2 diagnostic tables, accuracy, sensitivity, specificity, receiver operating characteristic (ROC) curves, F1-score, precision, recall, etc.
Those that adopted the following designs: prospective or retrospective cohorts, diagnostic accuracy trials, model development or validation studies, and comparative studies (eg, AI models vs AI models combined with clinical features).
Only studies published in English and with extractable quantitative data were deemed eligible.
Exclusion Criteria
The exclusion criteria excluded:
Studies involving nonhuman subjects (animal experiments or in vitro models), those that explored intracranial or coronary plaques, enrolled pediatric populations (<18 years), or reported only generalized atherosclerosis without plaque-specific criteria (focal intima-media thickness ≥1.5 mm) or specific diagnostic metrics;
Those that did not adopt well-defined deep learning models or radiomics algorithms, focused only on image segmentation or texture analysis without diagnostic validation, or reported predictive models without providing a clear diagnostic relevance.
Studies that lacked a validated reference standard.
Studies that did not report diagnostic performance.
Informal publication types (eg, reviews, letters to the editor, editorials, and conference abstracts).
Studies that did not report validation or test sets.
Screening of Articles and Data Extraction
In the initial screening, duplicates were excluded followed by reading of full texts, and data were entered into a predefined extraction table, which included surnames of authors, source of data, publication year, algorithm architecture, type of internal validation, availability of open access data, external verification status, reference standard, transfer learning application, number of cases for training, test, internal, or external validation, study design, sample size, mean or median age, inclusion criteria, and model evaluation metrics. The contingency tables are derived from the models explicitly identified by the original authors as the best-performing ones. Data from external validation sets were prioritized. If there were no external validation set in the original studies, data from internal validation sets were used. If neither was available, the contingency tables corresponding to the test sets were selected. This process was performed by two researchers (LJ and YG), working independently, and any differences were resolved through discussion with a third researcher (HG).
Quality Assessment
Two blinded investigators (LJ and YG) systematically assessed the quality of studies using the Quality Assessment of Diagnostic Accuracy Studies for Artificial Intelligence (QUADAS-AI) tool. Specifically, they evaluated the risk of bias and applicability concerns across 4 domains: flow and timing, reference standard, index test, and participant selection. Although the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) is extensively applied to assess the quality of diagnostic accuracy studies [ ], it does not address the specific methodological choices, result analyses, and measurements related to diagnostic studies using AI. To address this gap, QUADAS-AI was developed as a consensus-based tool to aid readers in systematically examining the risk of bias and the usability of AI-related diagnostic accuracy studies (Table S6 in ) [ ], thereby improving the quality assessment process [ , ]. Any evaluation discrepancies were resolved by a third investigator (HG).
Statistical Analysis
A meta-analysis was performed using STATA/MP software (version 17.0; Stata Corporation) with a bivariate random-effects model. For meta-analyses of the diagnostic accuracy of AI-based models, bivariate mixed-effects models can account for both within-study variability (random effects) and between-study heterogeneity (fixed effects), ensuring the robustness of the pooled estimates [ ]. A contingency table was generated using data from the included literature, and then we calculated metrics such as the number of cases, the Youden index, sensitivity, specificity, and recall. The diagnostic efficacy of radiomics algorithms and DL models in evaluating carotid plaque was determined using a summary receiver operating characteristic (SROC) curve and area under the curve (AUC; 0.7≤AUC<0.8 fair; 0.8≤AUC<0.9 good; and AUC≥0.9 excellent). Publication bias was explored using Deeks funnel plot asymmetry test. The Fagan nomogram was developed to determine clinically pertinent posttest probabilities (P-post) and likelihood ratios (LRs). LRs were determined by comparing the probability of test results between diseased and nondiseased groups. The pretest probability was subsequently adjusted based on test results and LRs to obtain P-post [ ]. The Cochran Q (P≤.05) and I2 statistic were used to explore heterogeneity among the included studies, and regression analysis was conducted to assess sources of heterogeneity. I2≤50% indicated mild heterogeneity, 50%0.95 or <0.7; n=11 studies).
Results
Study Selection
We obtained 5834 studies in the initial analysis, of which 1233 were excluded for duplication or redundancy. After screening titles and abstracts, 4507 publications were eliminated. After the full texts of the 94 articles were read, 40 studies were eligible for meta-analysis. The PRISMA flow diagram of the study showing the selection process is presented in .
Study Characteristics
Among the 40 studies that fulfilled the systematic review’s inclusion criteria, 34 provided sufficient quantitative data (contingency tables from validation or test sets) eligible for incorporation into the meta-analysis. The detailed characteristics of all 40 eligible studies are summarized in Tables S3 and S4 in , while all subsequent quantitative analyses were conducted based on the 34 studies with available quantitative data. Overall, 34 studies were included [ - ], among which 9 were multicenter studies [ , , , , , - , ], 3 used public databases [ , , ], 13 provided open access to the data [ , , , - , , , , - ]. A total of 12 studies conducted internal validation [ , , , , , , , , , , , ] to confirm the reproducibility of the model development process and prevent overfitting. In addition, 7 studies conducted external validation [ , , , , , , ] to assess the model’s transportability and generalizability using unused datasets. Only 1 study conducted a comparative analysis of the diagnostic performance of DL models with that of clinicians [ ]. The medical imaging modalities included PRs (n=5), ultrasound (n=16), MRI (n=5), and CTA (n=8). The core features of the 34 studies are presented in and , with further details provided in Tables S2 and S3 in .
Meta-Analysis of Diagnostic Performance
Synthesized Results
The meta-analysis revealed pooled sensitivity, specificity, and an area under the SROC curve (SROC AUC) of 0.88 (95% CI 0.85‐0.91; I2=93.58%; P<.001; in [ - ]), 0.89 (95% CI 0.85‐0.92; I2=91.38%; P<.001; in [ - ]), and 0.95 (95% CI 0.92‐0.96) for all 34 studies ( ); 0.88 (95% CI 0.84‐0.92; I2=93.70%; P<.001; [ - ]), 0.91 (95% CI 0.86‐0.94; I2=95.55%; P<.001; [ - ]), and 0.95 (95% CI 0.93‐0.97) for all DL models ( ); 0.89 (95% CI 0.82‐0.93; I2=90.20%; P<.001; [ - ]), 0.83 (95% CI 0.76‐0.88; I2=78.92%; P<.001; [ - ]), and 0.92 (95% CI 0.89‐0.94) for all ML models based on radiomics algorithms ( ), respectively. Notably, some studies used multiple diagnostic models; however, the diagnostic accuracy of certain models was not thoroughly assessed.
Subgroup Analysis
Medical Imaging Modalities
The pooled sensitivity, specificity, and SROC AUC were 0.91 (95% CI 0.80‐0.96), 0.93 (95% CI 0.84‐0.97), and 0.97 (95% CI 0.95‐0.98) for the 5 studies using PRs (P<.001; with 5 contingency tables; ); 0.89 (95% CI 0.84‐0.93), 0.90 (95% CI 0.84‐0.94), and 0.95 (95% CI 0.93‐0.97) for the 16 studies using ultrasound images (P<.001with 16 contingency tables; ); 0.87 (95% CI 0.87‐0.92), 0.87 (95% CI 0.76‐0.93), and 0.93 (95% CI 0.91‐0.95) for the 5 studies using MRI images (P<.001; with 5 contingency tables; ); 0.83 (95% CI 0.76‐0.88), 0.83 (95% CI 0.75‐0.89), and 0.90 (95% CI 0.87‐0.92) for the 8 studies using CTA images (P<.001; with 8 contingency tables; ), respectively. In addition, we conducted subgroup analyses using the same imaging modality based on differentiation. However, only subgroups of identifying the presence and stability of plaque had sufficient data for the ultrasound modality to perform statistical analyses and obtain pooled diagnostic performance metrics (Table S5 in ). The pooled sensitivity, specificity, and SROC AUC were 0.88 (95% CI 0.72‐0.96), 0.91 (95% CI 0.80‐0.96), and 0.95 (95% CI 0.93‐0.97) for determining the presence of plaques (P<.001; with 5 contingency tables; ), 0.90 (95% CI 0.84‐0.94), 0.92 (95% CI 0.83‐0.96), and 0.96 (95% CI 0.94‐0.97) for distinguishing the stability of plaques (P<.001; with 8 contingency tables; ).
Use of Transfer Learning
The pooled sensitivity, specificity, and SROC AUC were 0.92 (95% CI 0.87‐0.95), 0.93 (95% CI 0.88‐0.96), and 0.97 (95% CI 0.95‐0.96) for the 10 studies using transfer learning (P<.001; with 10 contingency tables; ) and 0.86 (95% CI 0.82‐0.90), 0.86 (95% CI 0.81‐0.90), and 0.93 (95% CI 0.90‐0.95) for the 24 studies without transfer learning (P<.001; with 24 contingency tables; ), respectively.
Carotid Plaque Type
The pooled sensitivity, specificity, and AUC were 0.89 (95% CI 0.81‐0.94), 0.91 (95% CI 0.86‐0.95), and 0.96 (95% CI 0.94‐0.97) for the 11 studies identifying the presence or absence of carotid plaques (P<.001; with 11 contingency tables; ); 0.90 (95% CI 0.85‐0.94), 0.91 (95% CI 0.85‐0.95), and 0.96 (95% CI 0.94‐0.97) for the 12 studies identifying stable or vulnerable carotid plaques (P<.001; with 12 contingency tables), respectively ( ); and 0.86 (95% CI 0.78‐0.91), 0.81 (95% CI 0.74‐0.87), and 0.90 (95% CI 0.87‐0.92) for the 10 studies identifying symptomatic or asymptomatic plaques (P<.001; with 10 contingency tables; ), respectively.
Pure Artificial Intelligence Models Versus Models Constructed by Combining Clinical Features
The pooled sensitivity, specificity, and SROC AUC were 0.82 (95% CI 0.74‐0.88), 0.74 (95% CI 0.69‐0.79), and 0.77 (95% CI 0.73‐0.80) for the 7 studies involving pure artificial intelligence models meeting the inclusion criteria (P<.001; with 7 contingency tables; ) and 0.85 (95% CI 0.76‐0.92), 0.75 (95% CI 0.70‐0.80), and 0.77 (95% CI 0.73‐0.81) for models constructed by combining clinical features (P<.001; with 7 contingency tables; ), respectively.
Different Sets of Datasets
The pooled sensitivity, specificity, and AUC were 0.90 (95% CI 0.87‐0.93), 0.91 (95% CI 0.87‐0.93), and 0.96 (95% CI 0.94‐0.97) for testing sets (P<.001; with 27 contingency tables; ); 0.78 (95% CI 0.71‐0.83), 0.80 (95% CI 0.73‐0.86), and 0.86 (95% CI 0.82‐0.88) for external validation sets (P<.001; with 7 contingency tables; ), respectively.
Low and High or Unclear Risk of Bias Studies
The pooled sensitivity, specificity, and AUC were 0.80 (95% CI 0.73‐0.85), 0.80 (95% CI 0.71‐0.87), and 0.86 (95% CI 0.83‐0.89) for studies with a low risk of bias (P<.001; with 5 contingency tables; ), and 0.89 (95% CI 0.86‐0.92), 0.90 (95% CI 0.86‐0.93), and 0.95 (95% CI 0.93‐0.97) for studies with a high or unclear risk of bias (P<.001; with 29 contingency tables; ), respectively.
Different Sample Sizes of Model
The pooled sensitivity, specificity, and AUC were 0.91 (95% CI 0.86‐0.94), 0.92 (95% CI 0.87‐0.95), and 0.97 (95% CI 0.95‐0.98) for sample size≥200 (P<.001; with 14 contingency tables) ( ), and 0.85 (95% CI 0.80‐0.88), 0.86 (95% CI 0.80‐0.90), and 0.91 (95% CI 0.89‐0.94) for sample size<200 (P<.001; with 20 contingency tables; ), respectively.
Models With Different Research Designs (Multicenter Studies and Single-Center Studies)
The pooled sensitivity, specificity, and AUC were 0.84 (95% CI 0.77‐0.89), 0.87 (95% CI 0.81‐0.91), and 0.92 (95% CI 0.90‐0.94) for multicenter studies (P<.001; with 9 contingency tables; ), and 0.89 (95% CI 0.84‐0.92), 0.89 (95% CI 0.84‐0.93), and 0.95 (95% CI 0.93‐0.97) for single-center studies (P<.001; with 22 contingency tables; ), respectively.
Heterogeneity Analysis and Meta-Regression Analysis
The Cochran Q test was used to indicate the presence of heterogeneity among subgroups (significance level P≤.05) [ ]. The I² index was used to assess the extent of heterogeneity among studies [ ], revealing high sensitivity (I²=93.58%) and specificity (I²=91.38%; ). The Deek funnel plot asymmetry test, with P=.21, indicated no apparent publication bias ( ). Subgroup analyses were performed using the random-effects models to identify the potential sources of heterogeneity, particularly when I² exceeded 50% [ ]. Results were as follows:
AI model for carotid plaques: Both ML models based on radiomics algorithms and DL models exhibited high sensitivity, with an I2 of 90.20% and 93.70%, and high specificity, with an I2 of 78.92% and 95.55%, suggesting high performance and significant heterogeneity ( [ - ]).
Medical imaging modalities: the sensitivity and specificity for PRs (sensitivity I2=82.28%; specificity I2=79.16%; [ - ]) and ultrasound (sensitivity I2=96.92%; specificity I2=94.98%; [ - ]). The sensitivity and specificity for MRI (sensitivity I2=71.57%; specificity I2=73.21%; [ - ]) and the sensitivity for CTA (I2=56.80%) displayed moderate heterogeneity ( [ - ]). The specificity of CTA (I2=83.79%) was high ( [ - ]). In the ultrasound modality, the sensitivity and specificity for determining the presence of plaques (sensitivity I2=96.78%; specificity I2=97.97%; [ - ]) and distinguishing the stability of plaques (sensitivity I2=97.01%; sensitivity I2=94.43%; [ - ]) were high.
Use of transfer learning: the specificity for models using transfer learning (specificity I2=74.85%; [ - ]) displayed moderate heterogeneity. The sensitivity for models using transfer learning (sensitivity I2=79.84%; [ - ]) and the sensitivity and specificity for the models without transfer learning (sensitivity I2=94.12%; specificity I2=87.35%; [ - ]) were high.
Carotid plaque type: all plaque types showed higher sensitivity and specificity; presence or absence of plaques (sensitivity I2=94.08%; specificity I2=97.60%; part A in [ - ]), stable or vulnerable plaques with (sensitivity I2=95.19%; specificity I2=91.29%; part B in [ - ]), and symptomatic or asymptomatic plaques (sensitivity I2=93.28%; specificity I2=84.67%; part C in [ - ]).
Both pure AI models and combined clinical features models did not exhibit high heterogeneity for AI models (sensitivity I2=62.97%; specificity I2=2.41%; part B in [ , , , , , , ]) and combined models (sensitivity I2=69.77%; specificity I2=40.08%) for combined models (part A in [ , , , , , , ]).
Different sets of datasets: both testing (sensitivity I2=94.23%; specificity I2=93.45%; part A in [ - ]) and external validation (specificity I2=84.42%; part B in [ - ]) were high heterogeneity, except the sensitivity for external validation (I2=66.67%; part B in [ - ]).
Different risk of bias studies: the sensitivity and specificity for high or unclear risk of bias studies (sensitivity I2=94.61%; specificity I2=92.59%; part B in [ - ]) and the specificity for low risk of bias studies (I2=87.10%) were high (part A in [ - ]). The sensitivity for low risk of bias studies (I2=62.20%) was moderate (part A in [ - ]).
Different sample sizes of model: The sensitivity and specificity for sample size ≥200 (sensitivity I2=97.91%; specificity I2=97.40%; part A in [ - ]) and the specificity for sample size <200 (I2=78.02%; part B in [ - ]) were high. The sensitivity for sample size <200 (I2=60.64%) was moderate (part B in [ - ]).
Models with different research designs: The sensitivity and specificity for multicenter studies (sensitivity I2=81.36%; specificity I2=80.24%; part A in [ , , , - , - ]) and single-center studies (sensitivity I2=95.07 %; specificity I2=90.63%) were high (part B in [ , , , - , - ]).
The meta-regression did not explore the factors contributing to heterogeneity (parts A-I in [ - ]). The results of all subgroups are depicted in Table S4 in . The Fagan nomogram was used to evaluate the diagnostic performance of ML models based on radiomics algorithms and DL models for carotid plaques. The results showed a P-post of 89% and 12% for the positive and negative tests, respectively ( ).
Sensitivity Analysis
Excluding the specific studies did not significantly change our research results (Table S7-S8 in ).
Quality Assessment
The quality of the 34 studies was evaluated using the QUADAS-AI tool ( ). The QUADAS-AI specifically evaluates bias risk and applicability concerns in AI studies. Here, we observed that most studies had significant bias or applicability concerns, particularly regarding the selection of patients and index test. In the “patient selection” domain, 20 studies were classified as either high-risk or indeterminate due to reliance on closed-access data or failure to present the rationale and breakdown of its training, validation, and test sets. Only 7 externally validated studies were classified as low-risk in the “index test” category, while others showed elevated risks due to a lack of validation. In the “reference standard” assessment, the reference standard of all studies could be used to classify the target condition correctly. For the “flow and timing” assessment, 10 studies showed indeterminate risks due to insufficient justification for the timing between index and reference tests. Additionally, 20 studies presented significant concerns regarding applicability in the “patient selection” domain, receiving unclear ratings. In the “index test” domain, 7 studies were rated as having low applicability, while all studies received low applicability ratings in the “Reference Standard” domain.
Discussion
Principal Findings
This study represents the first systematic evaluation of ML models based on radiomics and DL models for the characterization of extracranial carotid plaques. Both approaches demonstrated robust diagnostic performance, with high SROC values of 0.95 and 0.92, respectively, highlighting their promising potential for clinical application in plaque detection and risk stratification.
Initially, the SP and SROC AUC of DL models were improved compared to ML models based on radiomics (0.91 vs 0.83; 0.95 vs 0.92), while their sensitivity was similar to that of ML (0.88). Moreover, we observed that radiomics and DL models used to identify the presence of plaques and stable plaques had similar diagnostic capabilities (SROC 0.96, 95% CI 0.94‐0.97), and both were effective in identifying symptomatic plaques (SROC 0.90, 95% CI 0.87‐0.92). Notably, these differences may not be simply due to model performance, but could result from a combination of different clinical objectives (simple exclusion diagnosis or differentiation of specific cases), imaging variations, and model techniques. By using knowledge gained from previous tasks, transfer learning enhances model performance on new datasets and minimizes data requirements. It has been successfully applied in various areas of cardiovascular disease to boost the performance of models [ , , ]. In subgroup analyses, transfer learning significantly enhances model performance in data-limited scenarios and prevents overfitting. Large sample sizes can minimize sampling bias, decrease overfitting, and enhance the stability and reproducibility of the models. Moreover, we performed more detailed subgroup analyses based on the same imaging modality. Only the type of plaques in the ultrasound modality had sufficient data to perform statistical analysis and obtain summary diagnostic efficacy indicators. Results showed that ultrasound-based models have demonstrated excellent and similar performance in detecting the presence of plaques and assessing their stability. Considering the differences in equipment characteristics, patient demographics, and study design, these findings should be interpreted with caution. Nevertheless, these results provide valuable insights into the efficacy of radiomics algorithms and DL models in the diagnosis of carotid plaque.
Analysis of the Main Aspects
This meta-analysis demonstrates that radiomics-based models and DL models can diagnose extracranial carotid plaque, but the advantages of DL models in specificity and SROC should be interpreted with caution. A review of the included studies revealed that, among the 24 investigations using DL models, 20 primarily focused on plaque characterization (11 on the detection of plaques and 9 on plaque stability). Of these, 13 studies used ultrasound imaging to identify plaque-specific features such as echogenicity, morphology, and composition. In contrast, among the 10 studies using radiomics-based ML models, 6 were dedicated to identifying symptomatic plaques, predominantly using MRI (n=2) and CTA (n=3). The accuracy of symptomatic plaque identification was influenced not only by intrinsic imaging characteristics but also by clinical indicators, including plaque rupture, thrombus formation, and the occurrence of cerebral hypoperfusion. The tasks were more complex, and model training seemed to focus on reducing false negatives to lower the risk of adverse outcomes such as stroke. In addition, traditional ML algorithms may rely on manual preprocessing and struggle to capture other subtle differences (such as the presence of tiny thrombi or fibrous cap thickness), which may introduce variability and additional costs. In contrast, the DL models (particularly convolutional neural networks) do not rely on artificially designed features; instead, they can directly process raw medical images, automatically filter noise, and automatically extract more meaningful image features (eg, slight echo attenuation behind plaques, differences in vascular wall elasticity, etc) [ ]. It can also analyze the preset artificial extraction features, conduct independent learning, and uncover potential rules, thereby addressing the aforementioned challenges [ , ]. It is worth noting that a mismatch in the number of studies may also affect the interpretation of the results. Therefore, these differences may not be simply due to model performance, but could also be caused by multiple factors, which need to be further investigated.
Besides, the “black box” nature of AI algorithms, particularly DL models, raises concerns about the transparency and reliability of decision-making. Of the 34 studies reviewed, only 2 used explainable DL models, achieving an accuracy of 98.2% [ , ]. The explainable AI (XAI) approach leverages visualization techniques, feature attribution analysis, and both global and local explanations to clarify how models derive predictions from input data. By enhancing transparency, XAI fosters greater trust among medical professionals, strengthens model reliability and accountability, and helps mitigate concerns related to opaque decision-making [ ]. The integration of XAI in medicine not only represents a technological advancement but also ensures safe, efficient, and robust medical decision-making, which needs to be further investigated. To realize this potential, a clinically oriented XAI implementation framework needs to be developed. First, the reporting criteria for interpretable techniques (including clinical applicability evaluation and operational guidelines) should be standardized to lower the threshold for physician use. Second, the design of algorithms should be optimized through collaborative efforts of medical professionals and engineers to improve the specificity of feature attribution methods based on real clinical needs. Further clinical validation studies are needed to evaluate the practical utility of XAI across diverse diagnostic settings—such as varying regions, hospital levels, and clinician experience—and to determine its true value in supporting clinical decision-making beyond algorithmic performance [ ]. Furthermore, incomplete disclosure of model development processes in reports, selective presentation of results by investigators, and heterogeneity in diagnostic standard implementation across practitioners with different levels of experience may decrease the reliability and generalizability of findings. Therefore, we recommend the formulation of standardized imaging protocols, reporting procedures, and quality control measures for carotid plaque assessment and advocate for the establishment of specialized AI reporting guidelines for cardiovascular diseases.
Advances in imaging technology have now largely met the diagnostic requirements of current clinical practice, and current guidelines place heavy reliance on imaging tests for carotid plaque assessment. Among the 34 included studies, 27 constructed diagnostic models based only on imaging data. However, this should not be interpreted as rendering other clinical parameters irrelevant. Multidimensional diagnostic models combined with clinical features have been shown to achieve good diagnostic performance in identifying various diseases, such as pancreatic ductal adenocarcinoma [ ], HCC recurrence after liver transplantation [ ], hemorrhagic brain metastases [ ], malignant BI-RADS 4 breast masses [ ], and others. In our study, the diagnostic performance of combined models did not slightly improve, which may be due to the small sample size or some features could not provide more diagnostic information (for example, Hu et al [ ] constructed a model relying only on indirect perivascular adipose tissue radiomic features and clinical features to identify symptomatic plaques, lacking direct imaging features). Considering this evidence, we strongly recommend that future research should aim to not only systematically incorporate laboratory tests, medical history, and other clinical parameters to develop multidimensional diagnostic models, but also to summarize the most meaningful features for specific types of plaques. This could address the limitations in current studies regarding single imaging modalities. This will also improve the precise classification of carotid plaques and personalized risk assessment.
This meta-analysis identified significant heterogeneity, while meta-regression and subgroup regression analysis did not identify the source, primarily attributable to the intrinsic challenges in regulating all potential confounding factors. Different imaging techniques can affect model performance based on the type of images used (static images vs dynamic videos), the equipment, and the operators. Guang et al [ ] used a contrast-enhanced ultrasound video-based DL model to evaluate the diagnostic efficacy of a new carotid network structure for assessing carotid plaques, whereas other ultrasound studies consistently used static images. The sequence of MRI scans also influences diagnostic outcomes. Zhang et al [ ] reported that a model incorporating a combination of T1-weighted, T2-weighted, dynamic contrast-enhanced, and postcontrast (POST) MRI sequences achieved a higher AUC for identifying high-risk carotid plaques compared to models using individual sequences or partial combinations. This enhanced performance is attributed to the complementary nature of these imaging sequences, each capturing distinct pathophysiological characteristics of the plaque, thereby improving diagnostic accuracy when used in combination. PRs have limited resolution, only detecting calcified components of carotid plaques and missing features such as lipid-rich necrotic cores or thin or ruptured fibrous caps. There are also notable differences in model architecture. Yoo et al [ ] found performance variations among different convolutional neural network architectures within the CACSNet framework on the same dataset. Gui et al [ ] compared multiple DL models (eg, 3D-DenseNet, 3D-SE-DenseNet) with 9 ML algorithms (including Decision Tree, Random Forest, SVM, etc) using identical datasets. They found that DL models generally performed better across key metrics like AUC and accuracy, with significant performance differences between and within the two model types. These suggest that scanning parameters, model architectures, image segmentation, and algorithms may explain the heterogeneity in the research results. However, the small number of studies limits our ability to perform comprehensive subgroup analyses, which need to be further investigated.
The use of AI has significantly promoted the diagnosis of carotid plaque; however, its application requires cautious evaluation. Only 9 studies were multicenter (most used external validation), with diagnostic performance lower than single-center studies. Most studies (n=29) had a high risk of bias due to a lack of open-source data and external validation and failure to present the rationale and breakdown of its sets, which led to overestimation of the research results and affected the reproducibility and generalizability of the findings. Similar issues have been noted in previous reports, highlighting a broader deficiency in rigorous research standards within the field [ - ]. Furthermore, the contingency tables mostly come from the testing sets. Although the testing set achieved the best diagnostic performance, it had higher data quality or similar data distribution to the training, or overfitting noise, resulting in inaccurate performance estimation, and strong regularization may also decrease its performance, ultimately undermining clinical confidence in these models.
This study has certain clinical significance. We conducted an in-depth literature review and methodological quality evaluation, presenting the most current and comprehensive systematic review of AI-based diagnostic approaches for assessing carotid plaque. The findings reveal that AI technology shows considerable potential for diagnosing carotid plaque, but the findings need to be further validated by conducting more rigorous external validation using large-scale, high-quality independent datasets.
Limitations
This study has several limitations. First, the heterogeneity in model architectures and validation methods across studies prevents definitive conclusions regarding the most effective AI approaches. Second, many studies lack multicenter external validation, leading to a high risk of bias. The model overfitting and clinical applicability need to be carefully evaluated. Third, meta-regression and subgroup analysis did not identify the sources of high heterogeneity that existed in most of the included studies. We hypothesize that this heterogeneity may be caused by scanning parameters, model architectures, image segmentation, and algorithms. However, the overly scattered distribution of subgroups due to the limited number of studies restricts more in-depth subgroup analyses. Finally, although the Deeks test did not show significant publication bias, the included studies may have intentionally unreported negative results and omitted potentially relevant non-English literature.
Future studies should use a more comprehensive analytical methodology based on the current model. Researchers should strictly follow regulatory norms and standardized operating procedures. Prospective and multicenter studies and additional external validation are warranted to enhance the robustness and generalizability of the existing models. In the future, researchers should perform independent systematic reviews on specific subtopics—such as imaging modalities, lesion types, or model architectures—to facilitate targeted evaluations of AI performance across distinct clinical scenarios. In addition, studies on imaging modalities such as CT and MRI are advocated to generate more data, conduct subgroup analyses, and clarify the optimal matching of modality, plaque type, and algorithm. Future efforts should focus on identifying more meaningful features and building and evaluating the diagnostic performance of multidimensional diagnostic models. In parallel, establishing clinically oriented, XAI frameworks will be essential for enhancing transparency.
Conclusions
Current findings indicate that radiomics algorithms and DL models can effectively diagnose extracranial carotid plaque. However, the irregularities in research design and the lack of multicenter studies and external validation limit the robustness of the present findings. Future research should aim to reduce bias risk and enhance the generalizability and clinical orientation of the models.
Rate this article
Login to rate this article
Comments
Please login to comment
No comments yet. Be the first to comment!
