Electronic Health Records
Improving Real-World Mortality Data Quality in Oncology Research: Augmenting Electronic Medical Records With Obituary, Social Security Death Index, and Commercial Claims Data
2Tulane University, New Orleans, LA
3Kennesaw State University, Kennesaw, GA
This study evaluated the relative improvements in mortality data capture of adding different external data to enriched electronic medical records (EMRs) for patients with melanoma.
An enriched EMR database, containing structured and unstructured data, was used to evaluate the incremental mortality data capture of the following external data sources: Social Security Administration (SSA), public obituary, and an administrative open-claims database for the claims data set. Overall survival (OS) was assessed for each data set and the composite data set using the Kaplan-Meier method.
A total of 3,882 patients were included in the study. The enriched EMR data set identified 1,085 patients with a death record. The SSA data set identified 213 patients (73 unique when combined with enriched EMR) with a death record, while the obituary data set identified 1,127 patients (241 unique). The administrative claims data set identified 378 patients (73 unique) with a death record; however, all these unique patients were already accounted for in the combined SSA and obituary data set. The composite data set yielded a median OS of 13.39 years, about 4 years shorter than the enriched EMR data set alone (17.63 years).
The 21st Century Cares Act, passed in 2016, requires the US Food and Drug Administration (FDA) to focus on using real-world evidence (RWE) derived from real-world data (RWD) in the regulatory approval process for new drugs and medical devices.1 RWD is produced outside of a clinical trial during routine clinical care or other sources such as personal devices.2 In 2021, the FDA issued guidance that further outlined the appropriate use of RWD for conducting regulatory studies.3
To evaluate the capture of mortality data of sources external to electronic medical records (EMRs) to better inform future research design choices for real-world data (RWD)–based studies in oncology.
When augmenting with enriched (ie, containing structured and unstructured curated data) EMR data, additional data sources added varying degrees of improvement to the mortality data found in the EMR data. Additionally, we found that using mortality data from an EMR source alone underestimated the overall survival (OS) in patients with cutaneous melanoma.
The utilization of multiple data sources is necessary to accurately estimate OS when using RWD in the oncology setting.
In studies using RWD, complete and accurate mortality data is an important factor, particularly in studies evaluating patient survival or treatment effectiveness. In the oncology setting, a patient's date of death is a necessary variable for deriving end points such as overall survival (OS) as well as for censoring purposes for other end points. Accurate and timely mortality data are important in determining effectiveness of treatments.4-7 In the context of RWE, mortality data are often documented by providers in structured (ie, discrete predetermined categories that become readily analyzed when tabulated) or unstructured data (ie, open-ended, free text, or written notes that otherwise require manual tabulation before analysis) in their electronic medical record (EMR) systems. EMRs are not designed to capture mortality for research purposes and have been shown to lack completeness.8,9 Mortality data may be missing in EMR data because of patients receiving care in multiple systems or lack of follow-up care.
Several other data sets have been considered to augment EMR mortality data. The National Death Index (NDI) is often considered the gold standard for mortality data because of the coverage and accuracy of the data, but usage as a data source is challenging in practical terms because of the 12- to 24-month lag between death and reporting. The Social Security Administration (SSA) Death Master File (DMF) is another public use file that has a high sensitivity rate measured against NDI.10 However, in 2011, the SSA removed about 4.2 million (4.7%) death records from the DMF because of the removal of protected state death records.11 In addition, around 40% of new deaths are not recorded in DMF because of the exclusion of protected state death records,12,13 and previous studies have shown that the DMF should not be used in isolation for mortality data.14 The Veterans Administration Beneficiary Identification and Record Locator Subsystem similarly has high sensitivity compared with NDI, but it is not representative of the general population.15 Across data sets, efforts meant to ensure compliance with the Health Insurance Portability and Accountability Act of 1996 have led to suppression of exact dates of death to weekly or monthly levels, imparting challenges in the accuracy of mortality outcomes.
Considering these latency, completeness, and external validity challenges, other data products have been developed that incorporate multiple sources, such as SSA DMF, claims data, and/or public obituary data, to create a more comprehensive source of mortality data.16 These data products can augment EMR data sources by improving the completeness of mortality data through data linkage by capturing data outside of the EMR data systems. Previous studies have reported augmenting EMR data with SSA and commercial death data improves sensitivity relative to the NDI.17-19 However, to our knowledge, no study has estimated to what extent obituary and claims data sources, along with SSA, augment enriched EMR data sources. Enriched EMR data is defined as including both extracted structured data and manually curated unstructured data. We evaluated the capture of mortality data of sources external to enriched EMR sources and the impact of augmenting enriched EMR-based data with the external data sources on OS in patients with cutaneous melanoma. These findings can better inform research design choices for future RWD-based studies in oncology.
This was a retrospective, observational cohort study using deidentified data from the ConcertAI, LLC EMR data set.20 The data set is a repository of EMR data available through data-sharing agreements with practices and other data providers. The enriched EMR data set includes data drawn from more than 100 principally nonacademic community-based oncology treatment sites. Practice sizes vary and include small community practices, major health systems, and large multisite oncology groups in both rural and urban settings throughout the United States. The data in the enriched EMR data set include mortality data from structured fields as well as fields curated by trained data curators from unstructured data (text and image documents, physician notes, etc). Any unique identifying patient information was fully removed from the analytical data set, and only deidentified data were used in the analysis.
Consistent with augmentation approaches used previously in the literature and industry,17-19 the enriched EMR data set was linked using patient-level data tokenization with external sources, including SSA, obituary, and an administrative open-claims data set that covers over 300 million patients and 98% of payers. The process of connecting patients across various data sources relied on a proprietary probabilistic matching algorithm provided by a third-party data vendor, which is designed as user-friendly software. The algorithm used patient identifiable information (PII) such as patient name, date of birth, and address to generate patient-level tokens that remained consistent across different data sources, as long as the underlying PII was identical. These matching tokens, with deidentified data, were then used to link a patient's records in one data set with records of the same patient in another set. The enriched EMR data set served as a reference point to compare the other external data sets.
The start of the study period was January 1, 2015, and the data cutoff for all patients was either March 31, 2022, or the end of the patient's record, whichever came first. All data sources were actively collecting data during the study period. The study was exempt from review by an institutional review board, and no waiver of authorization was required per US DHHS 45 CFR. Part 46 because it only used deidentified secondary data.
The patients included were age 18 years or older at initial diagnosis of any-stage cutaneous melanoma using diagnosis codes (International Classification of Disease [ICD] 10: C43.xx, D03.xx; ICD 9: 172.xx). Patients were included if present in the enriched EMR data set and the administrative claims data set.
Patient characteristics and clinical outcomes were assessed descriptively for the overall sample population. Continuous variables were described using mean and standard deviation, while categorical variables were described using frequencies and percentages.
We generated a series of cross-tabulations for the number of death dates present in the following pairs of data: (1) SSA versus enriched EMR, (2) obituary versus enriched EMR, (3) claims versus enriched EMR, (4) SSA + obituary versus enriched EMR, and (5) claims versus SSA + obituary. The Fisher's exact test was used to compare the groups. A Kaplan-Meier (KM) analysis was then conducted to measure OS for all patients with melanoma in each of the individual data sets as well as in each of their pairings with the EMR data set, all measured against a combined data set. OS was defined as the time from initial melanoma diagnosis date to death. If no death date was present, patients were censored on the date of last encounter in the EMR data. When a date of death was present in multiple data sources, the selection priority as the date of death for the combined data set is enriched EMR data set (priority), SSA data set (second priority), obituary data set (third priority), and claims (fourth priority). For example, if a patient had a date of death in one data source 2 years before another, the date of death was taken on the basis of the priority outlined above. A log-rank test, a nonparametric test used to compare the survival distributions of two or more groups, was used to determine if there was a significant difference between the KM curves.
A total of 3,882 patients were included in this study. The mean age of the patients at diagnosis was 60.8 years. A majority of the patients were male (58.8%), and a vast majority were White (91.4%). About one third of patients were initially diagnosed with stage III or stage IV melanoma, while about one third were diagnosed with stage 0-II and about one third had undocumented staging at initial diagnosis (Table 1).
The enriched EMR data set identified 1,085 patients with a date of death. The SSA data set identified 213 patients with a date of death. The obituary data set identified 1,127 patients with a date of death, and the administrative open claims data set identified 378 (Table 2, panels A-C; Fig 1).
Augmenting the enriched EMR with SSA data adds 52 records for a combined 1,137 death records, increasing mortality capture by about 5% compared with the enriched EMR alone (Table 2, panel A; Table 3; Fig 1). Augmenting the enriched EMR with obituary data added 241 records for a combined 1,326 death records, increasing mortality capture by about 22% compared with the enriched EMR alone (Table 2, panel B; Fig 1). Finally, supplementing enriched EMR with claims data added 73 records for a combined 1,158 death records, increasing mortality capture by about 7% compared with the enriched EMR alone (Table 2, panel C; Fig 1). Joining the enriched EMR with a combined SSA and obituary data set adds an incremental 251 records for a combined 1,336 death records, increasing mortality capture by 23% (Table 2, panel D; Table 3). Finally, adding claims data to the combined SSA and obituary data set adds no new death records (Table 2, panel E).
The composite data, composed of all data sets, had a mean (standard deviation [SD]) OS of 21.05 (0.855) years and a median (95% CI) OS of 13.4 (12.1 to 14.8) years. In ascending order of survival duration, obituary data alone yielded a mean (SD) OS of 24.14 (1.013) years and a median (95% CI) OS of 17.2 (15.2 to 19.2) years. The enriched EMR yielded a mean (SD) OS of 24.47 (1.025) years and a median (95% CI) OS of 17.6 (16.5 to 20.2) years. SSA alone yielded a mean (SD) OS of 33.60 (0.339) years, while open claims alone yielded a mean (SD) OS of 45.57 (1.248) years and a median (95% CI) OS of 62.7 years (Fig 2).
Combining the data sources shortened survival estimates. Enriched EMR combined with open claims yielded a mean (SD) OS of 23.42 (0.973) years and a median (95% CI) OS of 16.8 (15.1 to 17.8) years. Enriched EMR combined with SSA yielded a mean (SD) OS of 23.86 (0.998) years and a median (95% CI) OS of 17.2 (15.3 to 19.1) years (Fig 3). Since the claims data did not add any new death observations to the SSA and obituary combined data set, the fully combined data set composed of EMR, SSA, obituary, and claims data was identical to the combination of EMR, SSA, and obituary sources.
We found that augmenting the mortality values in an enriched EMR data set with mortality values from external sources increases the overall number of death records captured. The addition of these deaths from all three sources (ie, SSA, obituary, and claims) led to a statistically significant difference in the OS calculation for the patient population with an overall decrease in length in OS.
This research adds to the literature by demonstrating the extent to which the external data sources, including SSA, obituary, and claims data sets, enhance mortality data capture compared with enriched EMR data only. When used as a single source of augmentation, each of the external data sources added unique deaths to the enriched EMR data set. The obituary data set added the most value, followed by claims and then SSA. The obituary data set added the most death records to the enriched EMR data set and also provided the most incremental deaths in the composite data set. This may be due to the fact the data source compiles publicly available mortality information and may have the most current information. The administrative open claims data set captured over 370 deaths not identified in the enriched EMR. Deaths found in the claims data set but not in the EMR likely occurred in care settings beyond the visibility of the provider network. For example, about 80% of this network is in the community-based physician-office setting, so deaths occurring in a hospital that were not documented in the physician's notes would not show up in the EMR. Furthermore, deaths occurring in nononcology settings, such as the emergency room, may similarly not make their way back to the managing oncologist's notes. The SSA data set, although known to be highly sensitive compared with the NDI, does have limitations in the mortality data it provides, specifically data lag and limited state-reported deaths.10,12,13
Previous studies have found that augmenting other data sources with EMR data improves the completeness of data compared with NDI data alone.17,18 However, to our knowledge, no study has extensively conducted a comparison of a manual chart review-enriched EMR data set to one that has been additionally augmented with commercially available mortality data sources, including claims data. These findings demonstrate the importance of augmenting EMR data in defining clinical end points in oncology research. In this setting, without augmentation, over 250 patient deaths would not have been captured, and the median OS would have been misreported as about 4 years longer. Other clinical oncology end points, such as progression-free survival, rely on dates of death for identifying progression events and for censoring purposes and are likely also affected by incomplete data.
The findings also illustrate that single sources of death data report misleading OS rates. Specifically, the SSA and claims data show much longer survival rates compared with the composite data. These apparently superior survival rates are a result of the incompleteness of the death data found in those sources. This result also highlights the need to incorporate multiple sources of death data when conducting real-world studies, even if EMR data are not being used for a particular study.
There are several limitations to this study. Since these are RWD sources, the data found within all sources were not collected for research purposes. Some dates of death may not have been captured by any data source. An analysis was not conducted on the accuracy of the dates of death, but instead only the presence of the dates. For the KM analyses, the EMR date of death was assumed to be the most accurate since the data were not suppressed and the open-claims data were suppressed to the Sunday of the week the death occurred. Our reference group of enriched EMR data may understate the magnitude of mortality measure improvement, given that many EMR-based data rely only on structured data. It is possible that, owing to potentially different data refresh cadences, the SSA and obituary data may not be completely current. Our results are also particular to the melanoma setting with various stages at diagnosis and may not fully extend to other tumor types whose progression timing and patterns inform the relative gains in augmented mortality measures. However, although the magnitude of gains may vary, we believe that the impact of the augmentation of different data sources on mortality data capture may be directionally similar for other tumor types.18 Finally, data tokenization using probabilistic matching was used to safeguard patients' protected health information; however, tokenization makes it difficult to determine the accuracy of linkage. Any inaccuracies with that process would contaminate our results, which in turn may lead to misattributing a person who is alive as having a date of death or vice versa. Although this is possible, we have no indications of any specific issues related to this procedure.
In conclusion, when the enriched EMR data set was augmented with external data sets, all data sets captured additional mortality data than the enriched EMR data alone. The obituary data set provided the most value added to the enriched EMR data, followed by claims, and then SSA. Notably, claims did not add anything new once obituary and SSA sources had been included. Overall, augmenting all data sources together with the enriched EMR data had a significant impact on the OS results, which suggests RWD research benefits from using more than one data source for mortality data even when the EMR data have been pre-enriched with unstructured data.
Supported by ConcertAI, LLC.
ConcertAI, LLC does not make data sets publicly available because study data are used under license from source practices and other data providers. ConcertAI, LLC will consider requests to access study data sets on a case-by-case basis.
Conception and design: All authors
Administrative support: Ping Shao, Jon G. Tepsick, Herman E. Ray
Collection and assembly of data: Ping Shao, Jon G. Tepsick, Herman E. Ray
Data analysis and interpretation: All authors
Manuscript writing: All authors
Final approval of manuscript: All authors
Accountable for all aspects of the work: All authors
The following represents disclosure information provided by authors of this manuscript. All relationships are considered compensated unless otherwise noted. Relationships are self-held unless noted. I = Immediate Family Member, Inst = My Institution. Relationships may not relate to the subject matter of this manuscript. For more information about ASCO's conflict of interest policy, please refer to www.asco.org/rwc or ascopubs.org/cci/author-center.
Open Payments is a public database containing information reported by companies about payments made to US-licensed physicians (Open Payments).
Employment: UMass Memorial Hospital
Stock and Other Ownership Interests: Regeneron
Research Funding: ConcertAI, Regeneron
Travel, Accommodations, Expenses: Regeneron
Jon G. Tepsick
Research Funding: Merck (Inst), Merck KGaA (Inst), Janssen (Inst), Astellas Pharma (Inst), Lilly (Inst), Regeneron (Inst), Sandoz (Inst), AbbVie (Inst), Daiichi Sankyo (Inst), Bristol Myers Squibb (Inst), Gilead Sciences (Inst), EQRx (Inst)
Travel, Accommodations, Expenses: ConcertAI
Herman E. Ray
Consulting or Advisory Role: ConcertAI
No other potential conflicts of interest were reported.
The authors thank Mark S. Walker, Lincy S. Lal, Yana Natanzon, and Lukas Slipski (ConcertAI, LLC) for their contributions. The authors would also like to thank Dr Ravi Parikh (consultant to ConcertAI, LLC) for his critical review and medical insights.
|1.||US Food and Drug Administration: Framework for FDA's real-world evidence program. https://www.fda.gov/media/120060/download Google Scholar|
|2.||Khozin S, Blumenthal GM, Pazdur R: Real-world data for clinical evidence generation in oncology. J Natl Cancer Inst 109:djx187, 2017 Crossref, Google Scholar|
|3.||US Food and Drug Administration: Real-world data: Assessing electronic health records and medical claims data to support regulatory decision making for drug and biological products. https://www.fda.gov/media/152503/download Google Scholar|
|4.||Banerjee R, Prasad V: Are observational, real-world studies suitable to make cancer treatment recommendations? JAMA Netw Open 3:e2012119, 2020 Crossref, Medline, Google Scholar|
|5.||da Graca B, Filardo G, Nicewander D: Consequences for healthcare quality and research of the exclusion of records from the death master file. Circ Cardiovasc Qual Outcomes 6:124-128, 2013 Crossref, Medline, Google Scholar|
|6.||Lasiter L, Tymejczyk O, Garrett-Mayer E, et al: Real-world overall survival using oncology electronic health record data: Friends of Cancer Research Pilot. Clin Pharmacol Ther 111:444-454, 2022 Crossref, Medline, Google Scholar|
|7.||Rivera DR, Henk HJ, Garrett-Mayer E, et al: The Friends of Cancer Research Real-World Data Collaboration Pilot 2.0: Methodological recommendations from oncology case studies. Clin Pharmacol Ther 111:283-292, 2022 Crossref, Medline, Google Scholar|
|8.||Weiskopf NG, Hripcsak G, Swaminathan S, et al: Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform 46:830-836, 2013 Crossref, Medline, Google Scholar|
|9.||Weiskopf NG, Rusanov A, Weng C: Sick patients have more data: The non-random completeness of electronic health records. AMIA Annu Symp Proc 2013:1472-1477, 2013 Medline, Google Scholar|
|10.||Sohn MW, Arnold N, Maynard C, et al: Accuracy and completeness of mortality data in the Department of Veterans Affairs. Popul Health Metr 4:2, 2006 Crossref, Medline, Google Scholar|
|11.||US Department of Commerce: Change in public Death Master File records. https://ladmf.ntis.gov/docs/import-change-dmf.pdf Google Scholar|
|12.||Blackstone EH: Demise of a vital resource. J Thorac Cardiovasc Surg 143:37-38, 2012 Crossref, Medline, Google Scholar|
|13.||Maynard C: Changes in the completeness of the social security death master file: A case study. Internet J Epidemiol 11:2, 2013 Google Scholar|
|14.||Navar AM, Peterson ED, Steen DL, et al: Evaluation of mortality data from the Social Security Administration death master file for clinical research. JAMA Cardiol 4:375-379, 2019 Crossref, Medline, Google Scholar|
|15.||Cowper DC, Kubal JD, Maynard C, et al: A primer and comparative review of major US mortality databases. Ann Epidemiol 12:462-468, 2002 Crossref, Medline, Google Scholar|
|16.||Datavant: Mortality data in healthcare analytics: Sourcing robust data in a HIPAA-compliant manner. https://datavant.com/wp-content/uploads/2021/08/White-Paper_-Mortality-Data-in-Health-Care.pdf Google Scholar|
|17.||Curtis MD, Griffith SD, Tucker M, et al: Development and validation of a high-quality composite real-world mortality endpoint. Health Serv Res 53:4460-4476, 2018 Crossref, Medline, Google Scholar|
|18.||Zhang Q, Gossai A, Monroe S, et al: Validation analysis of a composite real-world mortality endpoint for patients with cancer in the United States. Health Serv Res 56:1281-1287, 2021 Crossref, Medline, Google Scholar|
|19.||Lerman MH, Holmes B, St Hilaire D, et al: Validation of a mortality composite score in the real-world setting: Overcoming source-specific disparities and biases. JCO Clin Cancer Inform 5:401-413, 2021 Link, Google Scholar|
|20.||ConcertAI: Products & services. https://www.concertai.com/products/ Google Scholar|