Privacy protection of sexually transmitted infections information from Chinese electronic medical records

Gong, Mengchun; Yu, Yue; Ouyang, Zihao; Shi, Wenzhao; Liu, Chao; Wang, Qilin; Nan, Jiale; Cai, Endi; Ding, Fen; Nie, Sheng

doi:10.1038/s41598-024-84658-9

Download PDF

Article
Open access
Published: 08 January 2025

Privacy protection of sexually transmitted infections information from Chinese electronic medical records

Mengchun Gong^1,2,3^na1,
Yue Yu²^na1,
Zihao Ouyang²^na1,
Wenzhao Shi²,
Chao Liu²,
Qilin Wang²,
Jiale Nan²,
Endi Cai²,
Fen Ding² &
…
Sheng Nie⁴

Scientific Reports volume 15, Article number: 1296 (2025) Cite this article

2137 Accesses
22 Altmetric
Metrics details

Subjects

Abstract

The comprehensive adoption of Electronic Medical Records (EMRs) offers numerous benefits but also introduces risks of privacy leakage, particularly for patients with Sexually Transmitted Infections (STI) who need protection from social secondary harm. Despite advancements in privacy protection research, the effectiveness of these strategies in real-world data remains debatable. The objective is to develop effective information extraction and privacy protection strategies to safeguard STI patients in the Chinese healthcare environment and prevent unnecessary privacy leakage during the data-sharing process of EMRs. The research was conducted at a national healthcare data center, where a committee of experts designed rule-based protocols utilizing natural language processing techniques to extract STI information. Extraction Protocol of Sexually Transmitted Infections Information (EPSTII), designed specifically for the Chinese EMRs system, enables accurate and complete identification and extraction of STI-related information, ensuring high protection performance. The protocol was refined multiple times based on the calculated precision and recall. Final protocol was applied to 5,000 randomly selected EMRs to calculate the success rate of privacy protection. A total of 3,233,174 patients were selected based on the inclusion criteria and a 50% entry ratio. Of these, 148,856 patients with sensitive STI information were identified from disease history. The identification frequency varied, with the diagnosis sub-dataset being the highest at 4.8%. Both the precision and recall rates have reached over 95%, demonstrating the effectiveness of our method. The success rate of privacy protection was 98.25%, ensuring the utmost privacy protection for patients with STI. Finding an effective method to protect privacy information in EMRs is meaningful. We demonstrated the feasibility of applying the EPSTII method to EMRs. Our protocol offers more comprehensive results compared to traditional methods of including STI information.

De-identification of electronic health record using neural network

Article Open access 29 October 2020

EMR usability and patient safety: a national survey of physicians

Article Open access 15 May 2025

Chinese medical named entity recognition integrating adversarial training and feature enhancement

Article Open access 28 April 2025

Background

The comprehensive adoption of EMRs offers both tangible and intangible benefits to hospitals¹, including enhanced quality of care, reduced medical errors, decreased costs², and unrestricted access to patient information across time and space³. However, medical information is generally considered highly sensitive, and any privacy breach can cause direct or indirect harm to patients⁴. The increasing reliance on EMRs correspondingly raises the potential negative impact of privacy breaches through EMRs, involving prescription records⁵, diagnostic codes⁶, genomic data with allele frequency⁷, and improperly published medical data. These breaches can lead to unintended losses for both hospitals and patients. Such privacy breaches can damage an individual’s social reputation and normal social behavior and can be nearly destructive for patients with STI.

Personal information related to STI is directly tied to personal privacy and personality rights. Protecting this information is essential for safeguarding individual human rights. Studies have shown that if hospitals lack adequate protective and management measures, these privacy concerns can become public knowledge through word of mouth, causing significant psychological harm to patients and deterring them from seeking standardized treatment^8,9. Moreover, due to the particular nature of STI, personal information also involves the legitimate interests of others and public health safety^10,11. International experience demonstrates that focusing unilaterally on either aspect can lead to mistakes¹². Overemphasizing public health safety and the unilateral and mandatory nature of STI information management may infringe upon the privacy rights of the individual, while overemphasizing the absolute protection of individual privacy rights may harm the legitimate interests of others (such as the health and life rights of their sexual partners) and public health safety. The correct approach is to properly balance the protection and management of personal information related to STI in all relevant fields, to achieve mutual reinforcement and promotion of health and human rights¹³.

The fear of disclosure and its associated consequences is particularly pronounced in certain cultural contexts. When deciding to access Human Immunodeficiency Virus (HIV) counseling, testing, and treatment services, clients often worry about the potential for their status to be disclosed and the negative consequences or risks that may follow. Lyimo et al. pointed out that these risks are associated with societal beliefs and perceptions of the disease¹⁴, as HIV/Acquired Immunodeficiency Syndrome (AIDS) is viewed as contagious, severe, life-threatening, and potentially caused by norm-violating behaviors, such as prostitution, homosexuality, and promiscuity. In Ghana, these risks are often culturally perceived as shameful (animguaseè), disrespectful (onnibuo), and dishonorable (onnianimuonyam). Consequences include divorce, rejection, ostracism, discrimination, and unemployment^15,16. Fear of stigmatization and its consequences negatively impacts potential clients seeking HIV testing, diagnosis, and treatment in sub-Saharan African countries, including Ghana. This situation poses a serious challenge for stakeholders committed to preventing the spread of HIV/AIDS and infections like that¹⁷.

Currently, due to the lack of robust management policies and technical safeguards for privacy protection in Chinese EMRs, and the absence of a unified industry standard, unauthorized use, leakage, and even illegal selling of medical data and information are becoming rampant¹⁸. With the continuous increase in patients’ self-protection awareness, medical disputes arising from inadequate protection of sensitive information have surged, significantly affecting the process of building an effective privacy protection mechanism. Therefore, addressing these issues has become one of the primary challenges in promoting the collection, transmission, and sharing of Chinese EMRs data.

Chinese health institutions across the country utilize a similar framework for EMRs, which includes sub-datasets such as diagnostic information, medical advice, laboratory test results, examination details, and surgical records¹⁹. Despite this standardization, challenges related to the continuity and integrity of EMRs documentation hinder effective multicenter data integration²⁰. Traditional methods for extracting information and ensuring privacy in Real-World Evidence (RWE) research often concentrate on fixed sub-datasets, like diagnosis and present illness history, along with specific data entities, such as confirmed records within patient EMRs. These fixed-___location identification strategies lead to significant shortcomings in data protection in RWE studies.

Currently, the mainstream strategies for protecting sensitive medical information internationally include Blockchain-Based Electronic Medical Record Exchanging²¹, with most also utilizing differential privacy and federated learning algorithms. These methods ensure that raw data is not shared, only statistical results are, which is suitable for multi-center research. However, this approach has the drawback of poor data availability and cannot perform operations that require direct access to the raw data. There are also data protection methods based on vector encryption algorithms²², which rely on the identification of sensitive information, followed by encryption for easier transmission and sharing^23,24. Additionally, as more data is stored in the cloud, privacy protection for keyword search and data queries in the cloud is becoming increasingly important.

The aforementioned privacy protection algorithms require substantial computational resources and suffer from poor encryption and sharing efficiency, as well as reduced data availability. Therefore, this study addresses the potential issue of unnecessary leakage of STI information during electronic health record data sharing by developing a desensitization protocol for STI information to fill this gap. Unlike existing desensitization methods that target specific fields, this protocol can scan the entire database, effectively blocking the risk of information leakage. Compared to other privacy protection strategies, this approach increases precision by reducing the false positive rate, preventing the desensitization of data that should not be altered, which could affect data usability. Moreover, it offers higher recognition efficiency, with minimal time and resource consumption during full database scans using regular expressions, allowing for rapid desensitization and reducing the impact on data usage efficiency.

This study will introduce the data we used and the corresponding privacy protection techniques in the Methods section. The Results section will present the data analysis outcomes and the effectiveness and practical impact of the privacy protection methods. The Discussion section will highlight the significance of the study, the areas that are lacking, and the research topics that we hope to include in future studies.

This study aims to automate and accurately extract information pertaining to STI from Chinese EMRs through advanced technology. Safeguarding this information is essential for maintaining patient privacy²⁵. Additionally, the research will investigate extraction techniques to develop suitable privacy protection strategies, thereby mitigating the risk of exposing sensitive STI information during data transmission. The research team has previously made strides in safeguarding pregnancy and gestation information²⁶. In this study, we will implement keyword-based techniques to improve the efficiency of information extraction and enhance the recall rate of privacy-related data.

Methods

Data source

This retrospective study utilized the Chinese Renal Disease Data System (CRDS) database, a comprehensive national EMRs repository. The CRDS comprises data from 19 tertiary referral hospitals across 10 provinces, representing China’s five geographical regions: North, Central, East, South, and Southwest. Complete EMRs from each hospital were transferred to the central database at Nanfang Hospital of Southern Medical University in Guangzhou. In this study we selected all patients with visit records from the database, specifically those whose visits occurred between the beginning of 2010 and the end of 2020, resulting in a total of 11,759,139 patients.

Structure of Chinese EMRs

According to Chinese national specifications for standard EMRs structure, Chinese EMRs consist of similar sub-datasets with minor differences in nomenclature, including diagnosis, test results, examine results, surgical sheets, medication, and medical record texts. The medical record texts are further divided into ten sections: course records, admission records, discharge records, referral records, consultation records, nursing records, death records, surgical notes, informed consent forms, and others. In the CRDS, the admission record texts have been preprocessed using Natural Language Processing (NLP) to include allergic history, chief complaint, disease history, tobacco and alcohol history, family history, marriage history, surgical history, and toxic exposure history. The general structure of Chinese EMRs is illustrated in Fig. 1.

Patient inclusion

This study utilized data from the CRDS database, encompassing patient records from January 1, 2010 to December 31, 2020. The statistical timeline included each patient’s last visit information, along with all previous medical histories. After removing duplicate entries, a random selection of 50% of the patients was made for statistical analysis. The admission record texts were preprocessed using NLP techniques within the CRDS to extract details and improve readability.

Enhance the dictionary of STI utilizing public corpora

This study aims to use regular expressions for extracting STI privacy information. To improve the accuracy of regular expression recognition, a public corpus will first be used to construct an STI-related corpus. The key term extraction in this paper utilizes the word2vec method from the field of NLP²⁷. Initially, the names of STI, along with their synonyms, are used as entry words for retrieval. Searches are conducted on internet diagnostic and treatment platforms, as well as online medical Q&A platforms, to acquire textual information about the diseases, covering dimensions such as diagnosis, related surgical procedures, relevant chief complaint descriptions, history of STI, and descriptions of STI.

The textual information for each disease is integrated into a single document, and “cut_all” word segmentation strategy is performed using the Jieba tool to incorporate as many word combinations as possible, while avoiding situations where words nested within a longer word cannot be recognized. For long terms (usually with more than 5 characters) that cannot be accurately segmented, a ___domain-specific dictionary composed of disease-related terms and vocabulary is formulated to assist Jieba segmentation seeing Fig. 2.

After obtaining the segmentation results, the word2vec method is used to vectorize the text representation. We adopt a grid search strategy in training the word2vec model to obtain the optimal information representation²⁸. Once training is completed, we use the model to predict each disease’s top 20 most relevant keywords to fulfill the dictionary, and the more index value closer to 1, the more relevant they are. The detail hyperparameter settings for the two methods are in the Supplementary Material 1. For the top 20 most relevant keywords, two professional healthcare practitioners were asked to conduct an independent review, excluding common terms and identifying synonyms for the disease to expand the dictionary, thereby enriching the regular expression.

Extract STI information using regular expressions

Common methods for de-identifying sensitive information typically involve indiscriminate anonymization of specific sub-datasets within Chinese EMRs, such as patient and doctor names, or using International Classification of Diseases (ICD) codes related to sensitive diseases to locate relevant data fields for de-identification. However, many privacy-sensitive details in EMRs are hidden within lengthy text records, which are often not subjected to de-identification. Therefore, we first need to filter out sections of these texts that contain sensitive information, followed by targeted de-identification to avoid inadvertently anonymizing relevant data that could compromise the quality of the research.

Given the challenges of extraction efficiency and complexity, we employed regular expressions (regex) to search for information related to STI throughout the entire electronic medical record, rather than relying solely on diagnostic codes in specific sub-datasets.

Building on the research team’s previous work²⁶, we initially enhanced our STI dictionary using public corpora and manually selected additional extraction keywords from the top 20 most relevant terms for each disease to improve the recall rate of sensitive information. Subsequently, we developed the Extraction Protocol of Sexually Transmitted Infections Information (EPSTII) based on expert input, textbooks, guidelines, and relevant literature, continuously refining the rules through multiple extraction results.

The EPSTII is a regular expression framework designed for STI. It aims to capture all relevant privacy information related to STI by expanding the regular expressions with synonyms for these diseases. This framework was further expanded through expert review and by referencing relevant guidelines and literature. It was also validated using some data, resulting in the regular expressions that were used in subsequent real-world data, which form the basis of the EPSTII.

Manual curation and verification

Initially, we utilized the EPSTII to extract patients with STI information from the disease history database comprising 1,634,877 patients. Next, we randomly selected 1,000 patients from those with STI information and another 1,000 patients from those without STI information, both drawn from the same disease history database. Under the guidance of two experts, we calculated the precision and recall rates using the formulas provided below.

$$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\:\text{r}\text{a}\text{t}\text{e}\:\text{o}\text{f}\:\text{E}\text{P}\text{S}\text{T}\text{I}\text{I}=\frac{\text{T}\text{P}}{\text{T}\text{P}\:+\:\text{F}\text{P}}$$

$$\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}\:\text{r}\text{a}\text{t}\text{e}\:\text{o}\text{f}\:\text{E}\text{P}\text{S}\text{T}\text{I}\text{I}=\frac{\text{T}\text{P}}{\text{T}\text{P}\:+\:\text{F}\text{N}}$$

$$\text{S}\text{u}\text{c}\text{c}\text{e}\text{s}\text{s}\:\text{r}\text{a}\text{t}\text{e}\:\text{o}\text{f}\:\text{S}\text{T}\text{I}\:\text{i}\text{n}\text{f}\text{o}\text{r}\text{m}\text{a}\text{t}\text{i}\text{o}\text{n}\:\text{c}\text{o}\text{n}\text{c}\text{e}\text{a}\text{l}\text{i}\text{n}\text{g}=\frac{\text{M}\text{a}\text{n}\text{u}\text{a}\text{l}\text{l}\text{y}\:\text{v}\text{e}\text{r}\text{i}\text{f}\text{i}\text{e}\text{d}\:\text{m}\text{a}\text{s}\text{k}\text{e}\text{d}\:\text{E}\text{M}\text{R}\text{s}\:\text{r}\text{e}\text{c}\text{o}\text{r}\text{d}\text{s}}{\text{M}\text{a}\text{s}\text{k}\text{e}\text{d}\:\text{E}\text{M}\text{R}\text{s}\:\text{r}\text{e}\text{c}\text{o}\text{r}\text{d}\text{s}}$$

True positives (TP) were patients correctly identified as having STI information, while false positives (FP) were patients incorrectly identified as having STI information. True negatives (TN) were patients correctly identified as not having STI information, and false negatives (FN) were patients incorrectly identified as not having STI information. Precision, the ratio of TP to the total predicted positives (TP + FP), indicates the accuracy of positive predictions. Recall, the ratio of TP to all actual positives (TP + FN), measures the model’s ability to identify actual positive cases.

After continuously refining the EPSTII rules to achieve higher precision and recall rates, we applied the EPSTII rules to the remaining EMRs to extract patients with STI information.

Privacy protection strategies

Based on our analysis of the identified information, we have developed privacy protection strategies aimed at preventing the unnecessary and inadvertent exposure of sensitive data during the RWE analysis process. With input from experts, we decided to replace each identified keyword, along with the ten characters before and after it, with asterisks (*) to de-identify sensitive STI information, thus ensuring patient privacy is protected.

Reviewers attempted to identify patients with STI within the privacy-masked sample to evaluate the effectiveness of the privacy protection strategies, as indicated by the formula presented above. In instances where it is necessary to use STI-related information, we systematically estimated the frequency of patient identity and privacy information usage to assess the risk of unnecessary privacy exposure.

Ethical considerations

This study received approval from the Medical Ethics Committee of Nanfang Hospital, Southern Medical University (approval number: NFEC-2019-213), with a waiver for patient informed consent due to its retrospective design. Additionally, it was approved by the China Office of Human Genetic Resources for Data Preservation Application (approval number: 2021-BC0037). All methods in this study were carried out in accordance with the relevant ethical guidelines and regulations and all experimental procedures involving human subjects, animals, or other regulated materials adhered to the ethical standards and guidelines outlined by the relevant governing bodies.

Results

Patient inclusion

After removing duplicate entries and applying a 50% entry ratio, 3,233,174 patients from various age groups who visited between January 1, 2010 and December 31, 2020 were selected. Notably, removing duplicates reduced the sample size from 10,539,648 to 6,466,357 due to multiple diagnostic records for individual patients. Among the sample, 3,102,028 patients had diagnosis records and 1,735,799 patients had complete medical records. The detail distribution of the EMRs sub-datasets and the workflow of our study is shown in Fig. 3.

Regex search pattern

After using the grid search method to train the word2vec model, the following keywords are obtained (Table 1. Taking gonorrhea and syphilis as examples, more details can be found in the Supplementary Material 2).

Table 1 Top 20 keywords generated by word2vec. The 3 columns are English translation, original Chinese text, and its relativity index.

Full size table

Based on the above dictionary, we investigated Chinese EMRs and, incorporating input from experts, textbooks, guidelines, and relevant literature, developed the EPSTII. Given the unique characteristics of medical record writing in various hospitals, these rules underwent iterative refinement, adjustment, validation, and continuous improvement. All regular expression search patterns were developed through expert meetings and discussions.

To implement regex, we utilized R software (version 4.2.2) to extract STI privacy information from each sub-dataset. In the following example, “disease_history” represents the sub-dataset of preprocessed admission records, and “disease_name” is the field containing the name of the disease. STI_privacy has all the patients’ records that have the disease in the search patterns.

STI_privacy = disease_history[grep(‘软下疳|软疳|软性下疳|披衣菌|衣原体|淋病|淋菌|性病性淋巴肉芽肿|生殖支原体|人型支原体|解脲支原体|梅毒|硬下疳|非淋菌性尿道炎|非淋菌尿道炎|念珠菌|(乙|丙).{0,4}肝|三阳|单纯疱疹|生殖器疱疹|生殖器溃疡|艾滋|尖锐湿疣|软疣|水疣|阴虱|蟹虱|滴虫性阴道炎|阴道滴虫|毛滴虫’, disease_history$disease_name, fixed = FALSE)]

The regular expression rules enable us to determine the proportion of patients with a history of STI across various sub-datasets of the EMRs and to quantify the frequency of STI information.

Number of STI information

Following preliminary investigations, we applied the EPSTII to the EMRs of 3,233,174 patients. Table 2 presents the distribution of STI information in diagnosis records from the 19 hospitals.

Table 2 Distribution of patients with STI information in 19 hospitals. The 4 columns are city area, total number of patients, total patients with STI information, and their proportions.

Full size table

We also plot the distribution and proportion of patients with STI in each year, as shown in Figs. 4 and 5.

In Fig. 4, the number of total patients and the number of patients with STI share similar patterns across a 10-year time span, indicating a relative constant proportion relationship, which is supported by Fig. 5. The proportion shown in Table 2; Fig. 5 demonstrate the reliability of our sample data.

Table 3 lists the total number of patients, the volume of identified STI information, and their corresponding proportions in each EMRs sub-dataset.

Table 3 Patient identified by EPSTII. From left to right, the columns display the total number of patients, the number of identified patient number, and their corresponding proportions.

Full size table

The number of patients with STI identified through diagnosis was 148,856 cases, accounting for 4.8% of the patients in the diagnosis sub-dataset. The number of patients identified through disease history reached 39,586 individuals, representing 2.4% of the patients in the disease history sub-dataset. The number of patients with STI information leakage identified through chief complaints was 35 cases, making up 0.06% of the patients in the chief complaints sub-dataset.

Test and exam results contain information such as the concentration of Hepatitis B e Antigen or Treponema Pallidum Antibody, while they do not state whether patients have STI or not. Therefore, we treated them as not containing STI information.

Frequency of recognition

A single patient can generate multiple records in the EMRs system per visit. Consequently, individual EMRs were divided into separate records based on visits, reflecting the actual EMRs storage used in RWE studies. Table 4 shows the frequency of STI information identification across different sub-datasets of Chinese EMRs. As with any patient’s record, STI can be widely identified in each sub-dataset, primarily concentrating in diagnostic records and medical record texts.

Table 4 Frequency of STI information identification. From left to right, the columns display the total number of records, the frequency of identified STI, and their corresponding proportions.

Full size table

In terms of privacy leakage, a total number of 681,942 STI records had been indentified from the diagnosis sub-datset with an overall recognition rate of 1.3%. In disease history records, 73,725 STI information could be extracted from 16,742,579 records. In family history records, 13,136 STI information could be extracted from 7,400,955 records.

Evaluation matrix

We selected precision and recall rates as our primary measurement criteria. During the manual curation and verification process, 2,500 records from patients with STI information and 2,500 records from patients without STI information were randomly chosen from the disease history datset and reviewed by two independent medical experts to determine whether STI information had been correctly identified or ignored. After multiple trainings of our protocol, the precision rate of EPSTII is 98.3%, and the recall rate is 96.8%, which means that out of the 2,500 pieces of information extracted from STI patients, 42 do not contain STI-related information, while out of the 2,500 pieces of information extracted from non-STI patients, 80 contain STI-related information.

Privacy protection

After optimizing the performance of EPSTII to a satisfactory level, 2,000 complete long text entries were randomly extracted from the medical records. EPSTII was then used to identify keywords related to STI. Under the guidance of experts, we decided to replace each identified keyword, along with 10 characters before and after it, with asterisks (*) to mask sensitive STI information and protect patient privacy. This method minimized the risk of inferring patient STI information from the EMRs. Reviewers attempted to identify patients with STI in the privacy-masked samples, and we calculated the success rate of the privacy protection strategy based on their results. In the 2,000 records with masked STI information, 35 patient was identified as carrying STI, achieving a success rate of 98.25%.

Here is one example of identifying STI information from part of medical records.

入院诊断: 1.急性淋巴细胞白血病(B-ALL)2.乙型肝炎病毒携带者

诊疗经过: 血细胞分析(静脉血): 白细胞计数11.77 × 10^9/L, 血小板计数485 × 10^9/L, 中性粒细胞8.93 × 10^9/L。。。。。。乙肝三对半: 乙型肝炎病毒表面抗原阳性(+);乙型肝炎病毒e抗体阳性(+); 乙型肝炎病毒核心抗体阳性(+)。EB病毒、巨细胞病毒阴性。腹部B超示肝稍大。.

{Admission diagnosis: (1) Acute Lymphoblastic Leukemia (B-ALL) (2) Hepatitis B virus carrier.

Diagnostic course: blood cell analysis (venous blood): white blood cell count 11.77 × 10^9/L, platelet count 485 × 10^9/L, neutrophil count 8.93 × 10^9/L…… Hepatitis B markers: hepatitis B virus surface antigen positive (+); hepatitis B virus e antibody positive (+); hepatitis B virus core antibody positive (+). EB virus, cytomegalovirus negative. Abdominal B-Ultrasound shows slightly enlarged liver.}

入院诊断: 1.急性淋巴细胞白血病(B-AL***********带者

诊疗经过: 血细胞分析(静脉血): 白细胞计数11.77 × 10^9/L, 血小板计数485 × 10^9/L, 中性粒细胞8.93 × 10^9/L。。。。。。**************抗原阳性***********抗体阳性***********心抗体阳性(+)。EB病毒、巨细胞病毒阴性。腹部B超示肝稍大。

{Admission diagnosis: 1. Acute Lymphoblastic Leukemia (B-AL*********** carrier.

Diagnostic course: blood cell analysis (venous blood): white blood cell count 11.77 × 10^9/L, platelet count 485 × 10^9/L, neutrophil count 8.93 × 10^9/L…… ************** antigen positive *********** antibody positive *********** core antibody positive (+). EB virus, cytomegalovirus negative. Abdominal B-Ultrasound shows slightly enlarged liver.}

It should also be noted that during the manual verification process by reviewers, 120 records were found to have potentially masked key data that did not contain STI privacy information, which could reduce the usability of the data. The false positive rate is 6%.

Discussion

Principal findings

In analyzing the extracted STI information and the frequency of STI identification, we observed that the quantity and frequency of STI information extracted from patients’ chief complaints were significantly lower than those extracted from disease histories and family histories. This suggests that patients who are aware of their own diseases are often reluctant to disclose information about STI to their doctors due to embarrassment and fear of a negative social image. This highlights the importance of protecting the privacy of STI information. On the other hand, patients who are unaware of their condition may not instinctively associate discomfort in their reproductive area with an STI, indicating that the negative stigma surrounding STI is deeply ingrained in people’s minds, further emphasizing the subconscious aversion to STI. Although the extraction proportion is even smaller in the case of toxic history, this is due to a large number of records denying the source of toxins, which have lowered the overall proportion.

After multiple optimizations, EPSTII achieved over 95% precision and recall rates. This success is primarily attributed to our continuous refinement of the keyword list and the integration of expert opinions, online resources, and literature information in each iteration. Currently, EPSTII encompasses a total of 31 keywords related to STI, providing comprehensive coverage of these infections.

Our privacy protection approach has achieved a 98.25% success rate, ensuring that the privacy of patients with STI is safeguarded to the greatest extent possible. However, due to improvements of our protocol happening in admission records, which did not include medication information, we failed to consider drugs specifically targeting STI.

Additionally, we acknowledge the potential for false positives. This issue arises because we have adopted a uniform length for privacy protection policies across different diseases, which may result in the incorrect concealment of some information and thereby limit the application scope of our model. Moreover, the sample size of 5,000 is relatively small compared to the vast amount of data in the database, contributing to our high accuracy. This limitation is due to the need for manual verification by experts for each extracted data point, which restricts our ability to extract and test a larger number of samples. In future studies, we plan to adopt a more targeted privacy protection method, taking into account drugs for treating STI, to precisely protect private information without obscuring other crucial information about patients. Additionally, we will expand the number of validation datasets to obtain more comprehensive and realistic outcomes.

This study uses extraction frequency to quantify privacy risks, but relying solely on this metric does not provide a comprehensive assessment of the actual risk of privacy leakage. The dataset used in this study is based on the CRDS database, which includes patient data from 19 medical institutions. Many other related studies have been conducted on this dataset. To explore the reliability of our privacy protocol, we will conduct an analysis of STI sensitive information exposure in datasets used in other types of studies in the future. On one hand, we will assess the risk of STI information being included in the data, and on the other hand, we will evaluate the sensitivity of researchers to relevant STI information and their actual exposure to it through a questionnaire survey. This will provide a more accurate risk assessment.

This study is limited to the Chinese context and the protection of privacy for sexually transmitted diseases in data sharing between Chinese medical institutions. In future work, we will refer to the General Data Protection Regulation (GDPR) to explore the applicability of the strategy in international settings and the results of English-language corpora. For cross-linguistic data sharing, we typically refer to the Observational Health Data Sciences and Informatics (OHDSI) terminology system, and while converting terms, we will combine it with our privacy protocol to enable sharing. This will also be a key focus of our upcoming work.

To our knowledge, this study is the first large-scale investigation into privacy leakage and STI identification in Chinese EMRs. Our study highlights the varying levels of awareness regarding privacy protection among EMRs users. While reliable methods for safeguarding privacy are extensively discussed in the literature^29,30, their actual implementation within healthcare institutions remains crucial, particularly concerning sensitive STI information^31,32. The 2021 Personal Information Protection Law of the People’s Republic of China defines the use of personal privacy data; however, prior efforts to establish robust standards for patient privacy protection in RWE research have been limited³³. This study is the first in China to utilize a national-level EMRs database for a quantitative assessment of the risk of privacy exposure related to STI. Our goal is to enhance protection strategies and provide a foundation for future research and the improvement of privacy protocols.

Chinese EMRs present unique challenges in extracting STI-related privacy information due to specific terminology and data standards. Traditional methods relying on diagnostic codes are limited by incomplete records and inconsistent documentation. Physicians may not always record STI information as a diagnosis, leading to lower recall rates and potential bias. The complexity of coding systems in Chinese EMRs further complicates patient identification. The EPSTII method addresses these limitations by providing more precise results, significantly enhancing the extraction of STI information with high accuracy from both patient-based and visit-based records.

Our results show that traditional fixed-___location data masking procedures can lead to unnecessary exposure of privacy information. For example, patients’ nucleic acid test results, recorded in sub-datasets, are challenging to conceal, significantly impacting patients’ privacy. Accurate identification of STI information is crucial for comprehensive privacy protection³⁴. The EPSTII method is the first to identify STI information across the entire Chinese EMRs system. It employs practical data desensitization techniques, such as data invalidation, data offset, and symmetric encryption, to prevent the misuse of private data. We have determined the optimal length of additional concealed text to retain most medical information. Quantifying the identification frequency of STI information helps researchers use EMRs wisely to avoid unnecessary privacy leakage, highlighting the richness of privacy information in EMRs. Although identification frequency alone cannot fully determine privacy leakage risk, these results underscore the importance of robust data desensitization strategies.

Limitations

Overall, this study underscores the importance of assessing privacy leakage risks and serves as a reference for effective privacy protection in Chinese EMRs. However, several limitations must be acknowledged. Firstly, the privacy risk estimates in our case study are primarily derived from the EMRs of a renal disease database. Despite the existence of official guidelines for EMRs documentation in China, discrepancies arise between the CRDS and other data networks in terms of data structure and operational environment. Therefore, specific protocols and variables should be optimized for broader applicability.

Additionally, our privacy protection method occasionally masks excessive information, such as the hepatitis B antibody levels of healthy patients, leading to the concealment of harmless data as well. This limitation stems from our reliance on regular expressions to enhance processing speed rather than employing more complex NLP techniques. Future research will explore the application of machine learning methods to better safeguard patient privacy.

Conclusions

Identifying an effective and practical method for protecting privacy information in EMRs is both significant and beneficial. We have demonstrated the feasibility of applying the EPSTII method to EMRs from 19 hospitals across various regions. We believe that EPSTII can provide valuable insights for patient inclusion in any research related to infectious diseases using Chinese EMRs. Our protocols, specifically tailored for the Chinese EMRs system, enable accurate and comprehensive identification and extraction of STI-related data, ensuring effective protection. Compared to traditional methods for including STI information, the EPSTII method yields more comprehensive results.

Data availability

The data that support the findings of this study are available on request from the corresponding author. Due to the sensitivity of the hospital data, it cannot be made publicly available.

Abbreviations

EMR:: Electronic medical records
STI:: Sexually transmitted infections
NLP:: Natural language processing
HIV:: Human immunodeficiency virus
AIDS:: Acquired immunodeficiency syndrome
RWE:: Real-world evidence
CRDS:: Chinese renal disease data system
EPSTII:: Extraction protocol of sexually transmitted infections information
TP:: True positives
FP:: False positives
TN:: True negatives
FN:: False negatives

References

Kuo, K. M., Ma, C. C. & Alexander, J. W. How do patients respond to violation of their information privacy? Health Inf. Manag. 43(2), 23–33. https://doi.org/10.1177/183335831404300204 (2014). PMID: 24948663.
Anderson, C. L. & Agarwal, R. The digitization of healthcare: Boundary risks, emotion, and consumer willingness to disclose personal health information. Inf. Syst. Res. 22 (3), 469–490. https://doi.org/10.1287/isre.1100.0335 (2011).
Article MATH Google Scholar
Zhou, L. et al. The relationship between electronic health record use and quality of care over time. J. Am. Med. Inf. Assoc. 16 (4), 457–464. https://doi.org/10.1197/jamia.M3128 (2009). Epub 2009 Apr 23. PMID: 19390094; PMCID: PMC2705247.
Hung, P. Towards a Privacy Access Control Model for e-Healthcare Services. Conference on Privacy, Security and Trust. (2005).
Emam, K. E., Dankar, F. K., Vaillancourt, R., Roffey, T. & Lysyk, M. Evaluating the risk of re-identification of patients from hospital prescription records. Can. J. Hosp. Pharm. 62 (4), 307–319. https://doi.org/10.4212/cjhp.v62i4.812 (2009). PMID: 22478909; PMCID: PMC2826964.
Article PubMed PubMed Central Google Scholar
Loukides, G., Denny, J. C. & Malin, B. The disclosure of diagnosis codes can breach research participants’ privacy. J. Am. Med. Inf. Assoc. 17 (3), 322–327. https://doi.org/10.1136/jamia.2009.002725 (2010). PMID: 20442151; PMCID: PMC2995712.
von Thenen, N., Ayday, E. & Cicek, A. E. Re-identification of individuals in genomic data-sharing beacons via allele inference. Bioinformatics 35(3), 365–371. https://doi.org/10.1093/bioinformatics/bty643 (2019). PMID: 30052749.
Gao, J., Yang, C. & Jia, T. Privacy protection needs of sexually transmitted diseases outpatients: Analysis and countermeasures. Natl. Med. Front. China 5 (15), 95–96 (2010).
MATH Google Scholar
Mundie, A., Lazarou, M., Mullens, A. B., Gu, Z. & Dean, J. A. Sexual and reproductive health knowledge, attitudes and behaviours of Chinese international students studying abroad (in Australia, the UK and the US): A scoping review. Sex. Health 18(4), 294–302. https://doi.org/10.1071/SH21044 (2021). PMID: 34399883.
Xie, Z. & Duan, Z. Balancing public health and privacy rights: A mixed-methods study on disclosure obligations of people living with HIV to their partners in China. Harm. Reduct. J. 21 (1), 30. https://doi.org/10.1186/s12954-023-00920-9 (2024). PMID: 38311762; PMCID: PMC10840163.
Article PubMed PubMed Central MATH Google Scholar
Edelman, E. J. et al. Opportunities for improving partner notification for HIV: Results from a community-based participatory research study. AIDS Behav. 18(10), 1888-97. https://doi.org/10.1007/s10461-013-0692-9 (2014). PMID: 24469221.
Clayton, E. W., Embí, P. J. & Malin, B. A. Dobbs and the future of health data privacy for patients and healthcare organizations. J. Am. Med. Inf. Assoc. 30(1), 155–160. https://doi.org/10.1093/jamia/ocac155 (2022). Erratum in: J. Am. Med. Inf. Assoc. 30(1), 208. https://doi.org/10.1093/jamia/ocac183 (2022). PMID: 36048014; PMCID: PMC9748537.
Wang, X. Research on the legal protection of personal information of AIDS patients — An analysis based on Chinese legal texts. J. Yunnan Univ. Law Ed. 26 (05), 44–51 (2013).
MATH Google Scholar
Lyimo, R. A. et al. Stigma, disclosure, coping, and medication adherence among people living with HIV/AIDS in Northern Tanzania. AIDS Patient Care STDS 28(2), 98–105. https://doi.org/10.1089/apc.2013.0306 (2014). PMID: 24517541.
Kwansa, B. K. Safety in the Midst of Stigma. Experiencing HIV/AIDS in Two Ghanaian Communities (African Studies Centre, 2013).
Dapaah, J. M. HIV/AIDS Treatment in Two Ghanaian Hospitals: Experiences of Patients, Nurses and Doctors (African Studies Centre, 2012).
Mbonu, N. C., van den Borne, B. & De Vries, N. K. Stigma of people with HIV/AIDS in Sub-saharan Africa: A literature review. J. Trop. Med. 2009, 145891. https://doi.org/10.1155/2009/145891 (2009). Epub 2009 Aug 16. PMID: 20309417; PMCID: PMC2836916.
Article PubMed PubMed Central Google Scholar
Sher, M. L., Talley, P. C., Yang, C. W. & Kuo, K. M. Compliance with electronic medical records privacy policy: An empirical investigation of hospital information technology staff. INQUIRY: J. Health Care Organ. Provis. Financ. 54 https://doi.org/10.1177/0046958017711759 (2017).
National Health Commission of the People’s Republic of China. Technical specification of hospital information platform based on electronic medical record.: National Health Commission of the People’s Republic of China. http://www.nhc.gov.cn/wjw/s9497/201406/a2014514701f4e76b14f3446f6318937.shtml (2014).
Wang, Z. Data integration of electronic medical record under administrative decentralization of medical insurance and healthcare in China: A case study. Isr. J. Health Policy Res. 8 (1), 24. https://doi.org/10.1186/s13584-019-0293-9 (2019). PMID: 30929644; PMCID: PMC6442402.
Article PubMed PubMed Central MATH CAS Google Scholar
Wu, G., Wang, S., Ning, Z. & Zhu, B. Privacy-preserved electronic medical record exchanging and sharing: A blockchain-based smart healthcare system. IEEE J. Biomed. Health Inf. 26 (5), 1917–1927 (2022). Epub 2022 May 5. PMID: 34714757.
Article MATH Google Scholar
Miao, Y. et al. Efficient privacy-preserving spatial range query over outsourced encrypted data. IEEE Trans. Inf. Forensics Secur. 18, 3921–3933. https://doi.org/10.1109/TIFS.2023.3288453. (2023).
Miao, Y. et al. Time-controllable keyword search scheme with efficient revocation in mobile e-health cloud. IEEE Trans. Mob. Comput. 23 (5), 3650–3665. https://doi.org/10.1109/TMC.2023.3277702 (2024).
Miao, Y. et al. Efficient privacy-preserving spatial data query in cloud computing. IEEE Trans. Knowl. Data Eng. 36(1), 122–136. https://doi.org/10.1109/TKDE.2023.3283020 (2024).
Farland, L. V. et al. Endometriosis and risk of adverse pregnancy outcomes. Obstet. Gynecol. 134(3), 527–536. https://doi.org/10.1097/AOG.0000000000003410 (2019). PMID: 31403584.
Liu, C. et al. Effective privacy protection strategies for pregnancy and gestation information from electronic medical records: Retrospective study in a national health care data network in China. J. Med. Internet Res. 26, e46455 (2024).
Article PubMed PubMed Central Google Scholar
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient estimation of word representations in vector space. https://doi.org/10.48550/arxiv.1301.3781 (2013).
Feurer, M. & Hutter, F.Hyperparameter optimization. in AutoML: Methods, Systems, Challenges 3–38.
Aryanto, K. Y., Oudkerk, M. & van Ooijen, P. M. Free DICOM de-identification tools in clinical research: functioning and safety of patient privacy. Eur. Radiol. 25 (12), 3685–3695. https://doi.org/10.1007/s00330-015-3794-0 (2015). Epub 2015 Jun 3. PMID: 26037716; PMCID: PMC4636522.
Article PubMed PubMed Central MATH CAS Google Scholar
Moore, W. & Frye, S. Review of HIPAA, Part 1: history, protected Health Information, and privacy and security rules. J. Nucl. Med. Technol. 47 (4), 269–272. https://doi.org/10.2967/jnmt.119.227819 (2019). Epub 2019 Jun 10. PMID: 31182664.
Article PubMed MATH Google Scholar
Cervinski, M. A. et al. Qualitative point-of-care and over-the-counter urine hCG devices differentially detect the hCG variants of early pregnancy. Clin. Chim. Acta 406 (1–2), 81–85 (2009). Epub 2009 May 27. PMID: 19477170.
Article PubMed CAS Google Scholar
Montagnana, M., Trenti, T., Aloe, R., Cervellin, G. & Lippi, G. Human chorionic gonadotropin in pregnancy diagnostics. Clin. Chim. Acta 412 (17–18), 1515–1520. https://doi.org/10.1016/j.cca.2011.05.025 (2011). Epub 2011 May 25. PMID: 21635878.
Article PubMed CAS Google Scholar
Data Security Law of the People’s Republic of China. The National People’s Congress of the People’s Republic of China. http://www.npc.gov.cn/npc/c30834/202106/7c9af12f51334a73b56d7938f99a788a.shtml (2021).
Ewuoso, C. Addressing the conflict between partner notification and patient confidentiality in serodiscordant relationships: How can Ubuntu help? Dev. World Bioeth. 20 (2), 74–85. https://doi.org/10.1111/dewb.12232 (2020). Epub 2019 May 26. PMID: 31131522.
Article PubMed Google Scholar

Download references

Funding

This work was supported by the National Key Research and Development Program of China (2021YFC2500800) and the National Key Research and Development Program of China (2023YFC2706305). This work was supported by Nanfang Hospital, Southern Medical University. We did not use generative AI in any portion of the manuscript writing.

Author information

Mengchun Gong, Yue Yu and Zihao Ouyang have contributed equally to this work.

Authors and Affiliations

School of Biomedical Engineering, Guangdong Medical University, Dongguan, China
Mengchun Gong
Digital Health China Technologies Co., Ltd., Beijing, China
Mengchun Gong, Yue Yu, Zihao Ouyang, Wenzhao Shi, Chao Liu, Qilin Wang, Jiale Nan, Endi Cai & Fen Ding
National Clinical Medical Center for Geriatric Diseases, Fudan University Affliated Huashan Hospital, Shanghai, China
Mengchun Gong
Nanfang Hospital, Southern Medical University, Guangzhou , China
Sheng Nie

Authors

Mengchun Gong
View author publications
Search author on:PubMed Google Scholar
Yue Yu
View author publications
Search author on:PubMed Google Scholar
Zihao Ouyang
View author publications
Search author on:PubMed Google Scholar
Wenzhao Shi
View author publications
Search author on:PubMed Google Scholar
Chao Liu
View author publications
Search author on:PubMed Google Scholar
Qilin Wang
View author publications
Search author on:PubMed Google Scholar
Jiale Nan
View author publications
Search author on:PubMed Google Scholar
Endi Cai
View author publications
Search author on:PubMed Google Scholar
Fen Ding
View author publications
Search author on:PubMed Google Scholar
Sheng Nie
View author publications
Search author on:PubMed Google Scholar

Contributions

Mengchun Gong, Yu Yue and Zihao Ouyang wrote the main manuscript text and Chaoliu, Qilin Wang, Jiale Nan, Endi Cai and Fen Ding provided some technical support to facilitate our data extraction work and also verified and reviewed our data. Wenzhao Shi, Mengchun Gong and Sheng Nie guided the direction of the article. All authors reviewed the manscript.

Corresponding authors

Correspondence to Mengchun Gong or Sheng Nie.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Supplementary Material 2

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Gong, M., Yu, Y., Ouyang, Z. et al. Privacy protection of sexually transmitted infections information from Chinese electronic medical records. Sci Rep 15, 1296 (2025). https://doi.org/10.1038/s41598-024-84658-9

Download citation

Received: 07 November 2024
Accepted: 25 December 2024
Published: 08 January 2025
DOI: https://doi.org/10.1038/s41598-024-84658-9

Subjects

Abstract

Similar content being viewed by others

De-identification of electronic health record using neural network

EMR usability and patient safety: a national survey of physicians

Chinese medical named entity recognition integrating adversarial training and feature enhancement

Background

Methods

Data source

Structure of Chinese EMRs

Patient inclusion

Enhance the dictionary of STI utilizing public corpora

Extract STI information using regular expressions

Manual curation and verification

Privacy protection strategies

Ethical considerations

Results

Patient inclusion

Regex search pattern

Number of STI information

Frequency of recognition

Evaluation matrix

Privacy protection

Discussion

Principal findings

Limitations

Conclusions

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s note

Electronic supplementary material

Supplementary Material 1

Supplementary Material 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links