The promise of Alzheimer’s disease (AD) biomarkers has led to their incorporation in new diagnostic criteria and in therapeutic trials; however, significant barriers exist to widespread use. Chief among these is the lack of internationally accepted standards for quantitative metrics. Hippocampal volumetry is the most widely studied quantitative magnetic resonance imaging (MRI) measure in AD and thus represents the most rational target for an initial effort at standardization.
Methods and Results
The authors of this position paper propose a path toward this goal. The steps include: 1) Establish and empower an oversight board to manage and assess the effort, 2) Adopt the standardized definition of anatomic hippocampal boundaries on MRI arising from the EADC-ADNI hippocampal harmonization effort as a Reference Standard, 3) Establish a scientifically appropriate, publicly available Reference Standard Dataset based on manual delineation of the hippocampus in an appropriate sample of subjects (ADNI), and 4) Define minimum technical and prognostic performance metrics for validation of new measurement techniques using the Reference Standard Dataset as a benchmark.
Although manual delineation of the hippocampus is the best available reference standard, practical application of hippocampal volumetry will require automated methods. Our intent is to establish a mechanism for credentialing automated software applications to achieve internationally recognized accuracy and prognostic performance standards that lead to the systematic evaluation and then widespread acceptance and use of hippocampal volumetry. The standardization and assay validation process outlined for hippocampal volumetry is envisioned as a template that could be applied to other imaging biomarkers.
- Although a position paper is a first step, the objective of standardizing hippocampal volumetry as an AD biomarker will require active participation by stakeholders in academia and industry. The authors’ objective is to see hippocampal volumetry evolve from its current state, a measure that is valid only in specific studies or a single institution, to a universally accepted biomarker with standardized units of measure. In some cases, this could simply involve having developers of automated measurement tools directly import the EADC-ADNI anatomic definition of the hippocampal boundaries into the atlas of the automated application.
- Standardizing single time point hippocampal volume as an AD biomarker is the most logical and readily achievable initial goal; however, the authors recognize that other more complex topographic structural MRI measures might be more specific, or ultimately more powerful. The major difficulty here is identifying an appropriate reference standard if an anatomically based classifier does not conform to the boundaries of a classically defined anatomic structure as the hippocampus does.
- Longitudinal change measures on structural MRI should be standardized using the approach outlined above as a template. This could include an extension of the EADC-ADNI effort to include expert manual tracing of serial hippocampi to create a longitudinal reference standard dataset using the same model as the single time point dataset proposed in this position paper.
- FDG PET, amyloid PET imaging, and possibly other MRI modalities (e.g., resting state functional connectivity, diffusion tensor imaging, and arterial spin labeled perfusion imaging) are also important imaging biomarkers for AD. Pursuing standardized quantitative metrics for these imaging modalities is a high priority. The efforts to standardize, validate and evaluate quantitative measures in these modalities could roughly follow the same approach outlined above for hippocampal volume.
- For all imaging biomarkers, future efforts will need to focus on developing a quantitative score to allow the assessment of individual imaging biomarker measures against well-developed norms that incorporate other appropriate covariates, such as age, sex and head size are for the hippocampus (91, 92).
- To optimize the use of biomarkers in new AD diagnostic criteria - future efforts will need to focus on establishing diagnostic cut points in the continuous range of quantitative values to identify normal, abnormal, and indeterminate levels in individual subjects. For use in clinical practice, quantitative metrics will need to be developed and then tested in clinically typical and representative populations. Diagnostic biomarkers in AD should function analogously to those in other diseases where, for example, cut points in the continuous range of blood pressure and fasting serum glucose are universally recognized as useful in aiding the diagnosis of hypertension and diabetes and standardized treatment protocols are based on these biomarker cut points. For the purposes of diagnosis in typical clinical settings, cut points should be derived from carefully characterized groups of subjects chosen in such a way that the results can be generalized to the overall population. For example, ADNI subjects were selected to represent a typical AD clinical trial, with specific inclusion/exclusion criteria. Thus the results from ADNI are not generalizable to the overall population and are not optimal to generate normative data for general diagnostic purposes. Selecting meaningful diagnostic cut points is complicated by the fact the many cognitively normal elderly subjects harbor significant AD pathology. Thus the definition of normal is not straightforward. Consensus guidelines have been established for evaluating and reporting the clinical utility of diagnostic biomarkers and should be followed in studies using the results of the assay validation steps described here. In clinical settings, the sensitivity of detecting AD should exceed 80% and specificity for distinguishing AD from other similar dementias also should exceed 80% (94). Standardized reporting of results should follow STARD criteria (95) and for clinical settings additional reporting criteria to demonstrate pragmatic utility are needed (96).
A biomarker is a physiological, biochemical, or anatomic parameter that can be objectively measured as an indicator of normal biologic processes, pathological processes, or responses to a therapeutic intervention (1). Biomarkers used in the Alzheimer’s disease (AD) field include both imaging measures and biofluid analytes. Biofluid analytes in this context can refer to proteins in any biofluid, however cerebrospinal fluid (CSF) biomarkers are presently the most well developed (2). The five most widely studied biomarkers in AD can be divided into two major categories: 1) Biomarkers of cerebral Aβ amyloid accumulation - these are increased radiotracer retention on amyloid-tracer based positron emission tomography (PET) imaging and low CSF Aβ 1-42, and 2) Biomarkers of neuronal degeneration or injury - these are elevated CSF tau (both total and phosphorylated tau); decreased fluorodeoxyglucose (FDG) uptake on PET in the temporo-parietal cortex; and brain atrophy in the medial, basal and lateral temporal lobes and the medial and lateral parietal cortices determined from structural magnetic resonance imaging (MRI) or computed tomography (CT) (3). Three of these five major AD biomarkers are imaging measures and imaging is the primary focus of this position paper. Biomarkers are increasingly important in AD in two contexts: clinical diagnosis/prognosis and therapeutic trials.
Criteria for the clinical diagnosis of AD were established in 1984 (4). These criteria have been widely adopted, validated against neuropathological examination in many studies, and are still used today. A consensus now exists, however, that diagnostic criteria for AD should be updated to reflect the scientific advances of the past quarter of a century. One of most important of these advances is the development of biomarkers for AD. This recognition has inspired recent efforts on several fronts to revise diagnostic criteria for AD. The two most well-known such efforts are those of Dubois et al (5, 6) and the National Institute on Aging (NIA)-Alzheimer’s Association (AA) (7-10). The NIA-AA commissioned three work groups to revise diagnostic criteria. Each was assigned the task of defining or revising criteria for one of three recognized phases of the disease: pre-clinical or asymptomatic AD, symptomatic pre-dementia or mild cognitive impairment (MCI), and the AD dementia phase (7-10). Biomarkers providing evidence of in situ AD pathophysiology are employed in the revised definitions of AD in all three phases of the disease by the NIA-AA and are also included in the criteria of Dubois et al (5, 6).
The second major use for biomarkers of AD is in clinical trials, where biomarkers can be employed for several distinct purposes. As an indicator of AD pathophysiological processes, AD biomarkers may be used for subject inclusion/exclusion – to ensure study subjects are appropriate for targeting of the therapeutic mechanism of action or as an enrichment strategy to improve efficiency of therapeutic trials (2, 11). Biomarkers also provide a biologically-based measure of disease severity. They can be used as a covariate in outcome analyses and as safety measures. Finally, an important application of AD biomarkers in clinical trials is as outcome measures, in which an effect on the biomarker is sought as evidence of modification of the underlying pathological AD process (12-21). However, since AD pathophysiology is increasingly being recognized to be very complex and multifaceted, effects of candidate drugs on some individual pathophysiological aspects of AD may not necessarily be of functional or cognitive relevance. Therefore, increasing efforts are being spent on developing biomarkers which could serve as surrogate endpoints in clinical trials, accurately predicting and reflecting clinically significant outcomes (2, 22) Biomarkers are more objective and reliable quantitative measures of AD pathophysiological processes than traditional cognitive and functional outcomes that are affected by subject motivation and extrinsic factors such as alertness, environmental stresses, and informant mood and distress.
The evaluation of the value of biomarkers is different for therapeutic trials than for clinical diagnosis, but the rationale and methods to standardize and validate the reliability of the measures are very similar. Moreover, if an imaging biomarker is used as an inclusion criterion for subjects participating in a clinical trial of a compound that subsequently achieves regulatory approval, then it is possible, some would say likely, that regulators will require the same biomarker must be approved as a diagnostic to identify patients that are suitable for treatment. This would then require that the biomarker, in our case imaging, be easily implementable in clinical imaging facilities world-wide. Therefore, although requirements in terms of precision and sensitivity to pathology may vary, issues pertaining to standardization of an imaging biomarker for use in clinical trials and for clinical diagnostics are inextricably interwoven.
The potential value of quantitative imaging biomarkers for both clinical diagnosis and clinical trials is clear, but major barriers exist to widespread acceptance and implementation. The most substantive barriers have been the lack of standardized methods for 1) image acquisition, 2) extraction of quantitative information from images, and 3) linking quantitative metrics to internationally recognized performance criteria. These in turn have impeded the establishment of cut points in the continuous range of quantitative values that can be used in diagnosis and evaluating change in clinical trials. Standardization of image acquisition for structural MRI and PET scans has been a major focus of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) project (23, 24) and ADNI acquisition protocols have become the de facto standard for clinical trials and could be applied clinically. On the other hand, little progress has been made in the standardization of techniques for quantitative image analysis, either in ADNI or in the field in general. This is particularly true for MRI where the lack of standardization has led to publication of values that are highly disparate across the literature. For example, greater than two-fold differences in hippocampal volume of cognitively normal elderly subjects have been reported from different centers (25). This is unlikely to have a basis in biology and is almost certainly due to inter-center differences in the measurement tools and the anatomical protocols for delineating the hippocampus. Likewise, a strong methodological dependence is evident in published rates of hippocampal atrophy. Three-fold differences in rates of hippocampal atrophy have been reported in elderly controls as well as wide variations in apparently similar cohorts of AD patients (26). For example, Du et al (27) reported annualized rates of hippocampal atrophy in healthy elderly controls mean age 77 of 0.8%/yr; Jack et al (28) in controls age 78 of 1.4%/yr and Wang et al (29) mean age 73 of 2.3%/yr. This strong dependence upon the method used and its specific implementation undermines the credibility of the results. Both newly proposed diagnostic criteria explicitly point out that extensive work on imaging biomarker standardization is needed prior to widespread adoption for diagnostic purposes.
2. Why hippocampal volume?
Qualification or general acceptance of the validity of a biomarker in clinical trials must rest on a well-established body of evidence beginning with widespread agreement that there is clinical significance to the result of the biomarker and that it can be measured with appropriate accuracy and reproducibility. Quantitative measurement of hippocampal volume fulfills these basic criteria. The advantages of hippocampal volume as a target for an initial standardization and assay validation exercise are: 1) The hippocampus is an anatomically defined structure with boundaries that are visually definable in a properly acquired MRI scan. 2) The hippocampus is involved early and progressively with neuronal loss and neurofibrillary tangles, which is one of the primary hallmarks of AD pathology (30). 3) A large imaging and pathology literature provides evidence that loss of hippocampal volume is significant in AD. Numerous studies have shown the association of hippocampal atrophy with neurodegenerative pathology at autopsy (31-36), with clinical diagnoses of AD or MCI (37-43), and with the severity of cognitive disorders and episodic memory deficits due to AD pathophysiology (44, 45). In addition, longitudinal measures of change in hippocampal volume both predict the future cognitive decline and correlate with contemporary indices of clinical decline (46, 47), and quantitative measures of the hippocampus predict progression from MCI to AD (48-63).. 4) Fully automated software tools are now available that can measure hippocampal volume efficiently and reproducibly (21, 37, 58, 64-71). Visual rating (72-74), while convenient and currently used in some diagnostic settings, does not lend itself to detecting subtle size differences, lacks precision relative to quantitative methods, and does not take advantage of the power of current technology. Formal computer-aided manual tracing of the entire hippocampus was introduced over two decades ago to aid in seizure lateralization (75). Although manual hippocampal tracing has been effective for research studies in different diseases, and still serves as the best available Reference Standard measure of the hippocampus on MRI (76), it is time consuming and requires highly trained operators. Thus it is not feasible in routine clinical practice and due to its expense it is impractical in clinical trials. Fully-automated hippocampal volumetry using standardized methods would be a practical alternative to manual methods. Automated hippocampal volumetry has successfully enabled the discovery of novel genes associated with hippocampal volume in over 7000 subjects scanned at multiple internationally distributed sites. This result supports the assertion that such methods can be efficiently and reproducibly applied on a worldwide scale (77). Furthermore, software methods that employ within-subject registration permit sensitive measures of volume change over time (51, 78). 5) While more complex MRI measures of disease-related atrophy consisting of combinations of multiple regions of interest (ROI) might have superior diagnostic properties compared to hippocampal volume (79-84), the analysis of hippocampal volume is less complex than multi-ROI approaches so a reference standard is easier to generate. Specifically, the hippocampus can be delineated by hand, but the disease signatures of more complex analytic methods are a result of training and machine learning methods that would present a further challenge to validate, and are likely to evolve over time.
Further supporting hippocampal volumetry as a target for initial AD imaging biomarker standardization and assay validation is the fact that clinical guidelines in many countries (85, 86) dictate that all patients investigated for cognitive impairment should undergo structural brain imaging to exclude treatable causes such as tumors and hematoma. An MRI acquisition sequence that would permit quantitative analysis of hippocampal volume is easy to include in a routine clinical MRI examination, only lengthens the exam by a few minutes, and is currently considered to be an essential part of a clinically diagnostic imaging protocol at some centers. Moreover, a significant effort has already been expended to standardize acquisition parameters for the high resolution 3D anatomical MR imaging sequence needed for quantitative volume measures across MRI vendors in the ADNI study (23). The ADNI 3D T1 anatomical sequence used for volumetric measurements can be performed in a standardized manner in an overwhelming majority of imaging centers worldwide. Finally, there is an ongoing international initiative led by one of the co-authors (GBF) to establish a Reference Standard in hand-drawn hippocampal volumes, which is the European Alzheimer’s Disease Centers (EADC) – ADNI Hippocampal Harmonization Effort (87, 88).
The issue of validating imaging biomarkers for AD has recently drawn the attention of non-profit organizations, including the Radiological Society of North America (RSNA) and the Coalition Against Major Disease (CAMD). CAMD is part of Critical Path Institute a nonprofit public private partnership dedicated to more efficient drug development. Qualification of hippocampal atrophy for use in clinical trial enrichment is being pursued by CAMD with the US Food and Drug Administration (FDA) and European Medicines Agency (EMA). At a meeting of The Radiological Society of North America Quantitative Imaging Biomarkers consortium in September, 2010 a work group was convened to address the issue of standardizing quantitative imaging of AD. Among the candidate imaging modalities discussed, measures of hippocampal volume on structural MRI were identified as the most widely used in the context of multicenter clinical trials, and therefore were the most obvious candidates for an initial (exemplar) effort to standardize quantitative imaging biomarkers. This position paper follows from the recommendations of this RSNA work group.
3. Biomarker development
In general terms, three separate steps are required for biomarker development: 1) Assay validation (also called technical or analytical performance validity) to show that, when following defined standardized procedures, the biomarker can be measured precisely and accurately compared to a reference standard (89), 2) Clinical Validation to establish that the biomarker has value for a specific intended task and context of use, and 3) Qualification of the biomarker with the appropriate regulatory agencies based upon wide-spread consensus that the biomarker is “fit for purpose” for a particular use. Each proposed task (e.g., diagnostic, prognostic, outcome) needs to be considered separately. Qualification of a biomarker for clinical trials may be a stepping stone to a qualification for its use as a clinical diagnostic. However, the use of a biomarker in clinical diagnosis is distinct from its use in therapeutic trials, and development may focus on one or the other first. The use of a biomarker in clinical trials is at the discretion of the trial sponsor, but mechanisms have been introduced by which regulatory bodies (e.g., the US Food and Drug Administration Center for Drug Evaluation and Research, FDA CDER; or the European Medicines Agency EMA) qualify biomarkers for use in clinical trials. The use of a biomarker for clinical diagnosis requires regulatory approval in the relevant jurisdiction (e.g., approval by FDA Center for Devices and Radiological Health, CDRH, in the USA; or CE marking in Europe), and may separately also require approval from healthcare funders for reimbursement.
4. Steps to standardization and validation of hippocampal volumetry as a biomarker of AD
Below we outline the steps of a proposed work plan that would lead to standardization of quantitative (automated or manual) hippocampal volumetry as a biomarker for AD in evaluative studies in the context of clinical trials and for diagnosis.
- Establish an Oversight Board to manage the effort and empower this body with authority to make decisions necessary to assess the results as outlined below. The Oversight Board should have the following attributes: a) include all necessary areas of expertise, b) be unbiased, c) represent both academia as well as industry, and d) be international. All potential conflicts of interest must be fully disclosed. Our recommendation is that this oversight board be linked to the Alzheimer’s Association.
- Identify a standardized definition of anatomic hippocampal boundaries on MRI with the assistance of expert neuroanatomists for use as a Reference Standard. Anatomic boundary criteria should be acceptable to the international scientific community and consistent with use in all neuroscience disciplines. We recognize that for hippocampal volume measures to be widely used diagnostically in clinical practice and in clinical trials, automated techniques are essential. However, manual tracing of the hippocampus using a consensus-from-experts approach in accordance with a standardized definition provides the most effective Reference Standard to evaluate automated methods. Expert opinion is an accepted method to create a reference standard. This is preferable to the alternative, arbitrarily picking one automated method and anointing it as the Reference Standard, which would be problematic. Because most, if not all, automated techniques rely on some a priori anatomical notion of hippocampal boundaries, such an arbitrary approach would not reflect a consensus from the scientific community as a whole and would not result in a Reference Standard with broad-based support from all stakeholders. Since an international effort is currently in place with precisely this aim, leveraging the work of the EADC-ADNI Hippocampal Harmonization effort (87, 88) is the most logical and practical approach. The Reference Standard recommended by the authors of this position paper is therefore the manual hippocampal tracing of ADNI subjects who will be developed by the EADC-ADNI effort.
- Establish a Reference Standard Dataset based on manual delineation of the hippocampus in accordance with the standardized definition. The Reference Standard Dataset should have the following attributes:
- All subjects in the reference database must have given informed consent for public access under an ethics board-approved protocol. Compliance with relevant privacy legislation to the jurisdiction where the data were collected, and permission of a research ethics committee for use of the data should be obtained. In the US, the relevant guidelines are those of the Health Insurance Portability and Accountability Act (HIPAA); however, other jurisdictions will have different regulations.
- Access to the database must be straightforward, open, and readily available.
- Appropriate subjects, in clinical characteristics and number, must be included in the reference database – in this case, elderly cognitively normal control, MCI and AD subjects diagnosed according to internationally recognized diagnostic criteria.
- MRI scans must have been acquired with a standardized protocol that is amenable to widespread use.
- Appropriate clinical meta-data must be linked to the MRI scans and readily available to users – i.e., demographics, clinical diagnosis, basic neuropsychology, and longitudinal clinical course. The subjects, 3D volume T1-weighted images, and clinical data of ADNI represent a data set that meets these criteria. The authors recommend that the EADC-ADNI harmonization traces or masks of the 1.5T ADNI MPRAGE data serve as the hippocampal volume Reference Standard Dataset.
- Extend the Reference Standard Dataset to enable a thorough evaluation of technical aspects of MR acquisition on measurement performance. This includes the effects of MR vendor, receiver coil type, accelerated acquisition methods, and field strength. Although the EADC-ADNI harmonization plan focuses are on 1.5T data, a significant portion of neuroimaging in the future will be performed at 3T, with acquisition acceleration, and with increasingly complex coil arrays. The potential effects of these technical advances on measurement standardization should be investigated (90).
- Split the complete sample of traced hippocampi into balanced training and test data sets for assessing the technical performance characteristics of new analysis methods. This would enable automated methods to be trained on a portion of the reference data and then test performance against an independent subset of the reference data. Careful attention to the composition of these subsets is important so that age, gender or clinical variables are not inadvertently unbalanced.
- Develop standards for reporting measurement units including a standardized approach for normalization of raw hippocampal volume measures. This will include defining correct measures of head size through standardization of intracranial volume measures. In addition to disease severity, hippocampal volume is affected by other variables that are easily ascertained such as age, sex, and head size (taller people tend to have larger brains and thus larger intracranial volume) (91). Experience indicates that normalization of raw hippocampal volumes for these descriptive or confounds variables improves the performance of hippocampal volumetry in evaluation studies, and thus recommendations for standardized normalization procedures for adjusting raw hippocampal volumes (e.g., by head size, age, sex) in the reference data set will be necessary.
- Define minimum technical performance metrics as benchmarks to judge new analysis methods (89). At a minimum these metrics should include:
- Accuracy with respect to the manually traced Reference Standard Dataset. We note that automated techniques will likely not precisely match a manually traced Reference Standard. However, a straightforward mathematical transformation of the output an accurate automated algorithm to match the reference standard should be possible. Criteria would need to be set as to how close the automated method would have to match the manual tracing in order for it to be credentialed by the oversight board.
- Test/re-test precision. This would include not just numeric precision at the volume level, but also more exacting indices of area/pixel overlap such as Dice coefficients.
- Compliance with regulatory requirements (Good Clinical Practice (GCP), FDA 21 CFR part 11, EU GMP Annex 11 on Computerized Systems) for any computer systems running these algorithms.
- Define minimum prognostic performance metrics for new analysis methods based upon benchmarks established from Reference Standard Dataset: We recommend metrics that predict conversion from MCI to AD within 24 months, progression of dementia severity at 24 months in patients with AD, and maintenance of normal cognition at 24 months in cognitively normal subjects (sensitivity, specificity, positive and negative predictive value, ROC analysis). This will serve as further assay validation for new analysis methods.
- Empower the oversight board to oversee credentialing of applications for analysis methods. While the Reference Standard Dataset can be used to credential new manual tracers, its primary use is envisioned as a means of validating and credentialing automated hippocampal quantification methods for use in therapeutic trials and for new clinical diagnostic criteria. The board could also make context of use recommendations based on limitations identified during the evaluation of a particular method. In order for a potential hippocampal volume measurement application to be credentialed by the oversight board it would have to meet established technical and prognostic performance benchmarks using the reference data set described above.
Ideally, the work plan would follow the timeline above where initial steps would focus on establishing the reference standard of manual hippocampus traces, generating a standardized approach to volume normalization and benchmark performance metrics. Once the reference standard is established, then the focus likely would be on evaluation studies and qualifying the reference standard with the FDA and EMA for diagnostic, prognostic and outcome use in clinical trials. Standardized acquisition of MRI scans suitable for hippocampal volumetry are already widely performed and support from the pharmaceutical industry is likely. Subsequently, we expect evaluation studies will be conducted to show the diagnostic value of hippocampal volumetry use outside the context of clinical trials. We wish to emphasize that the intent of this position paper is not to stifle existing alternative methods or innovative development of new methods, but rather to facilitate the development of widely available implementations of automated hippocampal volumetry methods, and to serve as a template for an initial effort which can then be used for other imaging biomarkers.
As an example illustrating the approach discussed above we identified 373 ADNI subjects diagnosed as MCI at baseline who qualified for an analysis of time to progression to AD. Of the 397 ADNI subjects diagnosed as MCI at baseline, 16 had no follow-up visits, and 8 failed quality control, leaving 373 for this analysis (Table 1). A list of the ADNI subject ID numbers used in the example MCI analyses is included as a Supplement. All subjects had hippocampal volume measured in three ways, labeled Methods A, B and C here. In this exercise, we considered Method A to represent the Reference Standard Dataset, and assessed Methods B and C in two ways: technical performance accuracy relative to the Reference Standard Dataset and prognostic performance in predicting conversion from MCI to AD at 2 years post baseline. While the data presented below are real, and not hypothetical, the specific methods are left undefined because we do not wish to have this position paper misconstrued as evidence that the authors endorse a particular method for credentialing.
Of the 373 patients, 166 progressed from MCI to AD during follow-up and 8 progressed to non-AD dementia based upon clinical criteria. We also examined a subset of 313 subjects that either progressed to AD at or prior to the 24 month visit (n=135) or had available follow-up through the 24 month visit without progressing to AD (n=178) to evaluate differences in hippocampal volume for those that progressed at 24 months vs. those that remain stable. Subjects who progressed to non-AD dementia at or before 24 months were excluded from this analysis.
Method B potentially meets two major criteria for credentialing – it is highly accurate in the group-wise and individual measurement of hippocampal volume relative to Method A as shown in the table and scatter plots, and it also has essentially identical performance in predicting conversion from MCI to AD (Fig. 1, Table 2). Method C has a similar prognostic performance in predicting conversion to AD as Method A as shown in the ROC analysis, but in its current form might not meet technical accuracy criteria relative to the reference standard dataset. This is how we would envision the credentialing process would proceed for most automated applications, with the EADC-ADNI harmonization data set of manually traced hippocampi serving as the Reference Standard Dataset and the oversight committee setting predetermined minimal benchmark criteria to judge the performance of individual methods.
Scatterplots of hippocampal volume (cm3) by method. Spearman correlations and p-values are shown for each pair.
ROC Curves Comparing Prognostic Performance of Methods A, B, and C for Progression from MCI to AD within two years
One important feature of the process for critically evaluating automated hippocampal segmentation algorithms is the failure rate. For a variety of reasons, usually related to poor scan quality, automated algorithms will fail to produce a plausible result in some proportion of cases in a study. Taken to the extreme, imagine, for example, a method that produced perfect predictive results in cases that underwent successful hippocampal segmentation, but the method failed in 99% of the time. The method would score quite well on prognostic metrics, but would not be practical. A fair and objective approach therefore is needed to penalize automated segmentation algorithms that fail in an unacceptably high proportion of cases.