Background

The comprehensive adoption of EMRs offers both tangible and intangible benefits to hospitals1, including enhanced quality of care, reduced medical errors, decreased costs2, and unrestricted access to patient information across time and space3. However, medical information is generally considered highly sensitive, and any privacy breach can cause direct or indirect harm to patients4. The increasing reliance on EMRs correspondingly raises the potential negative impact of privacy breaches through EMRs, involving prescription records5, diagnostic codes6, genomic data with allele frequency7, and improperly published medical data. These breaches can lead to unintended losses for both hospitals and patients. Such privacy breaches can damage an individual’s social reputation and normal social behavior and can be nearly destructive for patients with STI.

Personal information related to STI is directly tied to personal privacy and personality rights. Protecting this information is essential for safeguarding individual human rights. Studies have shown that if hospitals lack adequate protective and management measures, these privacy concerns can become public knowledge through word of mouth, causing significant psychological harm to patients and deterring them from seeking standardized treatment8,9. Moreover, due to the particular nature of STI, personal information also involves the legitimate interests of others and public health safety10,11. International experience demonstrates that focusing unilaterally on either aspect can lead to mistakes12. Overemphasizing public health safety and the unilateral and mandatory nature of STI information management may infringe upon the privacy rights of the individual, while overemphasizing the absolute protection of individual privacy rights may harm the legitimate interests of others (such as the health and life rights of their sexual partners) and public health safety. The correct approach is to properly balance the protection and management of personal information related to STI in all relevant fields, to achieve mutual reinforcement and promotion of health and human rights13.

The fear of disclosure and its associated consequences is particularly pronounced in certain cultural contexts. When deciding to access Human Immunodeficiency Virus (HIV) counseling, testing, and treatment services, clients often worry about the potential for their status to be disclosed and the negative consequences or risks that may follow. Lyimo et al. pointed out that these risks are associated with societal beliefs and perceptions of the disease14, as HIV/Acquired Immunodeficiency Syndrome (AIDS) is viewed as contagious, severe, life-threatening, and potentially caused by norm-violating behaviors, such as prostitution, homosexuality, and promiscuity. In Ghana, these risks are often culturally perceived as shameful (animguaseè), disrespectful (onnibuo), and dishonorable (onnianimuonyam). Consequences include divorce, rejection, ostracism, discrimination, and unemployment15,16. Fear of stigmatization and its consequences negatively impacts potential clients seeking HIV testing, diagnosis, and treatment in sub-Saharan African countries, including Ghana. This situation poses a serious challenge for stakeholders committed to preventing the spread of HIV/AIDS and infections like that17.

Currently, due to the lack of robust management policies and technical safeguards for privacy protection in Chinese EMRs, and the absence of a unified industry standard, unauthorized use, leakage, and even illegal selling of medical data and information are becoming rampant18. With the continuous increase in patients’ self-protection awareness, medical disputes arising from inadequate protection of sensitive information have surged, significantly affecting the process of building an effective privacy protection mechanism. Therefore, addressing these issues has become one of the primary challenges in promoting the collection, transmission, and sharing of Chinese EMRs data.

Chinese health institutions across the country utilize a similar framework for EMRs, which includes sub-datasets such as diagnostic information, medical advice, laboratory test results, examination details, and surgical records19. Despite this standardization, challenges related to the continuity and integrity of EMRs documentation hinder effective multicenter data integration20. Traditional methods for extracting information and ensuring privacy in Real-World Evidence (RWE) research often concentrate on fixed sub-datasets, like diagnosis and present illness history, along with specific data entities, such as confirmed records within patient EMRs. These fixed-___location identification strategies lead to significant shortcomings in data protection in RWE studies.

Currently, the mainstream strategies for protecting sensitive medical information internationally include Blockchain-Based Electronic Medical Record Exchanging21, with most also utilizing differential privacy and federated learning algorithms. These methods ensure that raw data is not shared, only statistical results are, which is suitable for multi-center research. However, this approach has the drawback of poor data availability and cannot perform operations that require direct access to the raw data. There are also data protection methods based on vector encryption algorithms22, which rely on the identification of sensitive information, followed by encryption for easier transmission and sharing23,24. Additionally, as more data is stored in the cloud, privacy protection for keyword search and data queries in the cloud is becoming increasingly important.

The aforementioned privacy protection algorithms require substantial computational resources and suffer from poor encryption and sharing efficiency, as well as reduced data availability. Therefore, this study addresses the potential issue of unnecessary leakage of STI information during electronic health record data sharing by developing a desensitization protocol for STI information to fill this gap. Unlike existing desensitization methods that target specific fields, this protocol can scan the entire database, effectively blocking the risk of information leakage. Compared to other privacy protection strategies, this approach increases precision by reducing the false positive rate, preventing the desensitization of data that should not be altered, which could affect data usability. Moreover, it offers higher recognition efficiency, with minimal time and resource consumption during full database scans using regular expressions, allowing for rapid desensitization and reducing the impact on data usage efficiency.

This study will introduce the data we used and the corresponding privacy protection techniques in the Methods section. The Results section will present the data analysis outcomes and the effectiveness and practical impact of the privacy protection methods. The Discussion section will highlight the significance of the study, the areas that are lacking, and the research topics that we hope to include in future studies.

This study aims to automate and accurately extract information pertaining to STI from Chinese EMRs through advanced technology. Safeguarding this information is essential for maintaining patient privacy25. Additionally, the research will investigate extraction techniques to develop suitable privacy protection strategies, thereby mitigating the risk of exposing sensitive STI information during data transmission. The research team has previously made strides in safeguarding pregnancy and gestation information26. In this study, we will implement keyword-based techniques to improve the efficiency of information extraction and enhance the recall rate of privacy-related data.

Methods

Data source

This retrospective study utilized the Chinese Renal Disease Data System (CRDS) database, a comprehensive national EMRs repository. The CRDS comprises data from 19 tertiary referral hospitals across 10 provinces, representing China’s five geographical regions: North, Central, East, South, and Southwest. Complete EMRs from each hospital were transferred to the central database at Nanfang Hospital of Southern Medical University in Guangzhou. In this study we selected all patients with visit records from the database, specifically those whose visits occurred between the beginning of 2010 and the end of 2020, resulting in a total of 11,759,139 patients.

Structure of Chinese EMRs

According to Chinese national specifications for standard EMRs structure, Chinese EMRs consist of similar sub-datasets with minor differences in nomenclature, including diagnosis, test results, examine results, surgical sheets, medication, and medical record texts. The medical record texts are further divided into ten sections: course records, admission records, discharge records, referral records, consultation records, nursing records, death records, surgical notes, informed consent forms, and others. In the CRDS, the admission record texts have been preprocessed using Natural Language Processing (NLP) to include allergic history, chief complaint, disease history, tobacco and alcohol history, family history, marriage history, surgical history, and toxic exposure history. The general structure of Chinese EMRs is illustrated in Fig. 1.

Fig. 1
figure 1

Structure of Chinese EMRs. The EMRs consist of medical records, surgery, exam and test results, medication, and diagnosis. The admission records had been preprocessed by NLP. LIS: laboratory information system. HIS: hospital information system. CPOE: computerized physician order entry.

Patient inclusion

This study utilized data from the CRDS database, encompassing patient records from January 1, 2010 to December 31, 2020. The statistical timeline included each patient’s last visit information, along with all previous medical histories. After removing duplicate entries, a random selection of 50% of the patients was made for statistical analysis. The admission record texts were preprocessed using NLP techniques within the CRDS to extract details and improve readability.

Enhance the dictionary of STI utilizing public corpora

This study aims to use regular expressions for extracting STI privacy information. To improve the accuracy of regular expression recognition, a public corpus will first be used to construct an STI-related corpus. The key term extraction in this paper utilizes the word2vec method from the field of NLP27. Initially, the names of STI, along with their synonyms, are used as entry words for retrieval. Searches are conducted on internet diagnostic and treatment platforms, as well as online medical Q&A platforms, to acquire textual information about the diseases, covering dimensions such as diagnosis, related surgical procedures, relevant chief complaint descriptions, history of STI, and descriptions of STI.

The textual information for each disease is integrated into a single document, and “cut_all” word segmentation strategy is performed using the Jieba tool to incorporate as many word combinations as possible, while avoiding situations where words nested within a longer word cannot be recognized. For long terms (usually with more than 5 characters) that cannot be accurately segmented, a ___domain-specific dictionary composed of disease-related terms and vocabulary is formulated to assist Jieba segmentation seeing Fig. 2.

Fig. 2
figure 2

Flowchart of keywords generation by word2vec. Gonorrhea, including its synonyms, were used to search online to collect information about the disease. NLP method was used to analyze the text file and obtain 20 most relevant keywords.

After obtaining the segmentation results, the word2vec method is used to vectorize the text representation. We adopt a grid search strategy in training the word2vec model to obtain the optimal information representation28. Once training is completed, we use the model to predict each disease’s top 20 most relevant keywords to fulfill the dictionary, and the more index value closer to 1, the more relevant they are. The detail hyperparameter settings for the two methods are in the Supplementary Material 1. For the top 20 most relevant keywords, two professional healthcare practitioners were asked to conduct an independent review, excluding common terms and identifying synonyms for the disease to expand the dictionary, thereby enriching the regular expression.

Extract STI information using regular expressions

Common methods for de-identifying sensitive information typically involve indiscriminate anonymization of specific sub-datasets within Chinese EMRs, such as patient and doctor names, or using International Classification of Diseases (ICD) codes related to sensitive diseases to locate relevant data fields for de-identification. However, many privacy-sensitive details in EMRs are hidden within lengthy text records, which are often not subjected to de-identification. Therefore, we first need to filter out sections of these texts that contain sensitive information, followed by targeted de-identification to avoid inadvertently anonymizing relevant data that could compromise the quality of the research.

Given the challenges of extraction efficiency and complexity, we employed regular expressions (regex) to search for information related to STI throughout the entire electronic medical record, rather than relying solely on diagnostic codes in specific sub-datasets.

Building on the research team’s previous work26, we initially enhanced our STI dictionary using public corpora and manually selected additional extraction keywords from the top 20 most relevant terms for each disease to improve the recall rate of sensitive information. Subsequently, we developed the Extraction Protocol of Sexually Transmitted Infections Information (EPSTII) based on expert input, textbooks, guidelines, and relevant literature, continuously refining the rules through multiple extraction results.

The EPSTII is a regular expression framework designed for STI. It aims to capture all relevant privacy information related to STI by expanding the regular expressions with synonyms for these diseases. This framework was further expanded through expert review and by referencing relevant guidelines and literature. It was also validated using some data, resulting in the regular expressions that were used in subsequent real-world data, which form the basis of the EPSTII.

Manual curation and verification

Initially, we utilized the EPSTII to extract patients with STI information from the disease history database comprising 1,634,877 patients. Next, we randomly selected 1,000 patients from those with STI information and another 1,000 patients from those without STI information, both drawn from the same disease history database. Under the guidance of two experts, we calculated the precision and recall rates using the formulas provided below.

$$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\:\text{r}\text{a}\text{t}\text{e}\:\text{o}\text{f}\:\text{E}\text{P}\text{S}\text{T}\text{I}\text{I}=\frac{\text{T}\text{P}}{\text{T}\text{P}\:+\:\text{F}\text{P}}$$
$$\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}\:\text{r}\text{a}\text{t}\text{e}\:\text{o}\text{f}\:\text{E}\text{P}\text{S}\text{T}\text{I}\text{I}=\frac{\text{T}\text{P}}{\text{T}\text{P}\:+\:\text{F}\text{N}}$$
$$\text{S}\text{u}\text{c}\text{c}\text{e}\text{s}\text{s}\:\text{r}\text{a}\text{t}\text{e}\:\text{o}\text{f}\:\text{S}\text{T}\text{I}\:\text{i}\text{n}\text{f}\text{o}\text{r}\text{m}\text{a}\text{t}\text{i}\text{o}\text{n}\:\text{c}\text{o}\text{n}\text{c}\text{e}\text{a}\text{l}\text{i}\text{n}\text{g}=\frac{\text{M}\text{a}\text{n}\text{u}\text{a}\text{l}\text{l}\text{y}\:\text{v}\text{e}\text{r}\text{i}\text{f}\text{i}\text{e}\text{d}\:\text{m}\text{a}\text{s}\text{k}\text{e}\text{d}\:\text{E}\text{M}\text{R}\text{s}\:\text{r}\text{e}\text{c}\text{o}\text{r}\text{d}\text{s}}{\text{M}\text{a}\text{s}\text{k}\text{e}\text{d}\:\text{E}\text{M}\text{R}\text{s}\:\text{r}\text{e}\text{c}\text{o}\text{r}\text{d}\text{s}}$$

True positives (TP) were patients correctly identified as having STI information, while false positives (FP) were patients incorrectly identified as having STI information. True negatives (TN) were patients correctly identified as not having STI information, and false negatives (FN) were patients incorrectly identified as not having STI information. Precision, the ratio of TP to the total predicted positives (TP + FP), indicates the accuracy of positive predictions. Recall, the ratio of TP to all actual positives (TP + FN), measures the model’s ability to identify actual positive cases.

After continuously refining the EPSTII rules to achieve higher precision and recall rates, we applied the EPSTII rules to the remaining EMRs to extract patients with STI information.

Privacy protection strategies

Based on our analysis of the identified information, we have developed privacy protection strategies aimed at preventing the unnecessary and inadvertent exposure of sensitive data during the RWE analysis process. With input from experts, we decided to replace each identified keyword, along with the ten characters before and after it, with asterisks (*) to de-identify sensitive STI information, thus ensuring patient privacy is protected.

Reviewers attempted to identify patients with STI within the privacy-masked sample to evaluate the effectiveness of the privacy protection strategies, as indicated by the formula presented above. In instances where it is necessary to use STI-related information, we systematically estimated the frequency of patient identity and privacy information usage to assess the risk of unnecessary privacy exposure.

Ethical considerations

This study received approval from the Medical Ethics Committee of Nanfang Hospital, Southern Medical University (approval number: NFEC-2019-213), with a waiver for patient informed consent due to its retrospective design. Additionally, it was approved by the China Office of Human Genetic Resources for Data Preservation Application (approval number: 2021-BC0037). All methods in this study were carried out in accordance with the relevant ethical guidelines and regulations and all experimental procedures involving human subjects, animals, or other regulated materials adhered to the ethical standards and guidelines outlined by the relevant governing bodies.

Results

Patient inclusion

After removing duplicate entries and applying a 50% entry ratio, 3,233,174 patients from various age groups who visited between January 1, 2010 and December 31, 2020 were selected. Notably, removing duplicates reduced the sample size from 10,539,648 to 6,466,357 due to multiple diagnostic records for individual patients. Among the sample, 3,102,028 patients had diagnosis records and 1,735,799 patients had complete medical records. The detail distribution of the EMRs sub-datasets and the workflow of our study is shown in Fig. 3.

Fig. 3
figure 3

Workflow chart. The initial database had 11,759,139 patients. After selecting time span, removing duplicates, and applying a 50% entry ratio, the final sample consisted of 3,233,174 observations. Admission records were preprocessed by NLP resulting in 8 sub-datasets. EPSTII was used to extract patients with STI information. 1000 patients with STI information and 1000 patients without STI information were randomly selected to calculate precision and recall rate. Expert guidance helped refine the EPSTII rules. Once the protocol performed satisfactorily, EPSTII was applyed to the 2000 randomly selected EMRs to calculate the success rate of privacy protection strategy.

Regex search pattern

After using the grid search method to train the word2vec model, the following keywords are obtained (Table 1. Taking gonorrhea and syphilis as examples, more details can be found in the Supplementary Material 2).

Table 1 Top 20 keywords generated by word2vec. The 3 columns are English translation, original Chinese text, and its relativity index.

Based on the above dictionary, we investigated Chinese EMRs and, incorporating input from experts, textbooks, guidelines, and relevant literature, developed the EPSTII. Given the unique characteristics of medical record writing in various hospitals, these rules underwent iterative refinement, adjustment, validation, and continuous improvement. All regular expression search patterns were developed through expert meetings and discussions.

To implement regex, we utilized R software (version 4.2.2) to extract STI privacy information from each sub-dataset. In the following example, “disease_history” represents the sub-dataset of preprocessed admission records, and “disease_name” is the field containing the name of the disease. STI_privacy has all the patients’ records that have the disease in the search patterns.

STI_privacy = disease_history[grep(‘软下疳|软疳|软性下疳|披衣菌|衣原体|淋病|淋菌|性病性淋巴肉芽肿|生殖支原体|人型支原体|解脲支原体|梅毒|硬下疳|非淋菌性尿道炎|非淋菌尿道炎|念珠菌|(乙|丙).{0,4}肝|三阳|单纯疱疹|生殖器疱疹|生殖器溃疡|艾滋|尖锐湿疣|软疣|水疣|阴虱|蟹虱|滴虫性阴道炎|阴道滴虫|毛滴虫’, disease_history$disease_name, fixed = FALSE)]

The regular expression rules enable us to determine the proportion of patients with a history of STI across various sub-datasets of the EMRs and to quantify the frequency of STI information.

Number of STI information

Following preliminary investigations, we applied the EPSTII to the EMRs of 3,233,174 patients. Table 2 presents the distribution of STI information in diagnosis records from the 19 hospitals.

Table 2 Distribution of patients with STI information in 19 hospitals. The 4 columns are city area, total number of patients, total patients with STI information, and their proportions.

We also plot the distribution and proportion of patients with STI in each year, as shown in Figs. 4 and 5.

Fig. 4
figure 4

Enrolled number of patients with STI (blue) and total patients (orange) from 2010 to 2020. Both data lines increased annually starting from 2010, peaking in 2018 and then declined.

Fig. 5
figure 5

The proportion of enrolled patient with STI in total enrolled patients from 2010 to 2020. The proportion remained in a fix range, from 3–5%.

In Fig. 4, the number of total patients and the number of patients with STI share similar patterns across a 10-year time span, indicating a relative constant proportion relationship, which is supported by Fig. 5. The proportion shown in Table 2; Fig. 5 demonstrate the reliability of our sample data.

Table 3 lists the total number of patients, the volume of identified STI information, and their corresponding proportions in each EMRs sub-dataset.

Table 3 Patient identified by EPSTII. From left to right, the columns display the total number of patients, the number of identified patient number, and their corresponding proportions.

The number of patients with STI identified through diagnosis was 148,856 cases, accounting for 4.8% of the patients in the diagnosis sub-dataset. The number of patients identified through disease history reached 39,586 individuals, representing 2.4% of the patients in the disease history sub-dataset. The number of patients with STI information leakage identified through chief complaints was 35 cases, making up 0.06% of the patients in the chief complaints sub-dataset.

Test and exam results contain information such as the concentration of Hepatitis B e Antigen or Treponema Pallidum Antibody, while they do not state whether patients have STI or not. Therefore, we treated them as not containing STI information.

Frequency of recognition

A single patient can generate multiple records in the EMRs system per visit. Consequently, individual EMRs were divided into separate records based on visits, reflecting the actual EMRs storage used in RWE studies. Table 4 shows the frequency of STI information identification across different sub-datasets of Chinese EMRs. As with any patient’s record, STI can be widely identified in each sub-dataset, primarily concentrating in diagnostic records and medical record texts.

Table 4 Frequency of STI information identification. From left to right, the columns display the total number of records, the frequency of identified STI, and their corresponding proportions.

In terms of privacy leakage, a total number of 681,942 STI records had been indentified from the diagnosis sub-datset with an overall recognition rate of 1.3%. In disease history records, 73,725 STI information could be extracted from 16,742,579 records. In family history records, 13,136 STI information could be extracted from 7,400,955 records.

Evaluation matrix

We selected precision and recall rates as our primary measurement criteria. During the manual curation and verification process, 2,500 records from patients with STI information and 2,500 records from patients without STI information were randomly chosen from the disease history datset and reviewed by two independent medical experts to determine whether STI information had been correctly identified or ignored. After multiple trainings of our protocol, the precision rate of EPSTII is 98.3%, and the recall rate is 96.8%, which means that out of the 2,500 pieces of information extracted from STI patients, 42 do not contain STI-related information, while out of the 2,500 pieces of information extracted from non-STI patients, 80 contain STI-related information.

Privacy protection

After optimizing the performance of EPSTII to a satisfactory level, 2,000 complete long text entries were randomly extracted from the medical records. EPSTII was then used to identify keywords related to STI. Under the guidance of experts, we decided to replace each identified keyword, along with 10 characters before and after it, with asterisks (*) to mask sensitive STI information and protect patient privacy. This method minimized the risk of inferring patient STI information from the EMRs. Reviewers attempted to identify patients with STI in the privacy-masked samples, and we calculated the success rate of the privacy protection strategy based on their results. In the 2,000 records with masked STI information, 35 patient was identified as carrying STI, achieving a success rate of 98.25%.

Here is one example of identifying STI information from part of medical records.

入院诊断: 1.急性淋巴细胞白血病(B-ALL)2.乙型肝炎病毒携带者

诊疗经过: 血细胞分析(静脉血): 白细胞计数11.77 × 10^9/L, 血小板计数485 × 10^9/L, 中性粒细胞8.93 × 10^9/L。。。。。。乙肝三对半: 乙型肝炎病毒表面抗原阳性(+);乙型肝炎病毒e抗体阳性(+); 乙型肝炎病毒核心抗体阳性(+)。EB病毒、巨细胞病毒阴性。腹部B超示肝稍大。.

{Admission diagnosis: (1) Acute Lymphoblastic Leukemia (B-ALL) (2) Hepatitis B virus carrier.

Diagnostic course: blood cell analysis (venous blood): white blood cell count 11.77 × 10^9/L, platelet count 485 × 10^9/L, neutrophil count 8.93 × 10^9/L…… Hepatitis B markers: hepatitis B virus surface antigen positive (+); hepatitis B virus e antibody positive (+); hepatitis B virus core antibody positive (+). EB virus, cytomegalovirus negative. Abdominal B-Ultrasound shows slightly enlarged liver.}

入院诊断: 1.急性淋巴细胞白血病(B-AL***********带者

诊疗经过: 血细胞分析(静脉血): 白细胞计数11.77 × 10^9/L, 血小板计数485 × 10^9/L, 中性粒细胞8.93 × 10^9/L。。。。。。**************抗原阳性***********抗体阳性***********心抗体阳性(+)。EB病毒、巨细胞病毒阴性。腹部B超示肝稍大。

{Admission diagnosis: 1. Acute Lymphoblastic Leukemia (B-AL*********** carrier.

Diagnostic course: blood cell analysis (venous blood): white blood cell count 11.77 × 10^9/L, platelet count 485 × 10^9/L, neutrophil count 8.93 × 10^9/L…… ************** antigen positive *********** antibody positive *********** core antibody positive (+). EB virus, cytomegalovirus negative. Abdominal B-Ultrasound shows slightly enlarged liver.}

It should also be noted that during the manual verification process by reviewers, 120 records were found to have potentially masked key data that did not contain STI privacy information, which could reduce the usability of the data. The false positive rate is 6%.

Discussion

Principal findings

In analyzing the extracted STI information and the frequency of STI identification, we observed that the quantity and frequency of STI information extracted from patients’ chief complaints were significantly lower than those extracted from disease histories and family histories. This suggests that patients who are aware of their own diseases are often reluctant to disclose information about STI to their doctors due to embarrassment and fear of a negative social image. This highlights the importance of protecting the privacy of STI information. On the other hand, patients who are unaware of their condition may not instinctively associate discomfort in their reproductive area with an STI, indicating that the negative stigma surrounding STI is deeply ingrained in people’s minds, further emphasizing the subconscious aversion to STI. Although the extraction proportion is even smaller in the case of toxic history, this is due to a large number of records denying the source of toxins, which have lowered the overall proportion.

After multiple optimizations, EPSTII achieved over 95% precision and recall rates. This success is primarily attributed to our continuous refinement of the keyword list and the integration of expert opinions, online resources, and literature information in each iteration. Currently, EPSTII encompasses a total of 31 keywords related to STI, providing comprehensive coverage of these infections.

Our privacy protection approach has achieved a 98.25% success rate, ensuring that the privacy of patients with STI is safeguarded to the greatest extent possible. However, due to improvements of our protocol happening in admission records, which did not include medication information, we failed to consider drugs specifically targeting STI.

Additionally, we acknowledge the potential for false positives. This issue arises because we have adopted a uniform length for privacy protection policies across different diseases, which may result in the incorrect concealment of some information and thereby limit the application scope of our model. Moreover, the sample size of 5,000 is relatively small compared to the vast amount of data in the database, contributing to our high accuracy. This limitation is due to the need for manual verification by experts for each extracted data point, which restricts our ability to extract and test a larger number of samples. In future studies, we plan to adopt a more targeted privacy protection method, taking into account drugs for treating STI, to precisely protect private information without obscuring other crucial information about patients. Additionally, we will expand the number of validation datasets to obtain more comprehensive and realistic outcomes.

This study uses extraction frequency to quantify privacy risks, but relying solely on this metric does not provide a comprehensive assessment of the actual risk of privacy leakage. The dataset used in this study is based on the CRDS database, which includes patient data from 19 medical institutions. Many other related studies have been conducted on this dataset. To explore the reliability of our privacy protocol, we will conduct an analysis of STI sensitive information exposure in datasets used in other types of studies in the future. On one hand, we will assess the risk of STI information being included in the data, and on the other hand, we will evaluate the sensitivity of researchers to relevant STI information and their actual exposure to it through a questionnaire survey. This will provide a more accurate risk assessment.

This study is limited to the Chinese context and the protection of privacy for sexually transmitted diseases in data sharing between Chinese medical institutions. In future work, we will refer to the General Data Protection Regulation (GDPR) to explore the applicability of the strategy in international settings and the results of English-language corpora. For cross-linguistic data sharing, we typically refer to the Observational Health Data Sciences and Informatics (OHDSI) terminology system, and while converting terms, we will combine it with our privacy protocol to enable sharing. This will also be a key focus of our upcoming work.

To our knowledge, this study is the first large-scale investigation into privacy leakage and STI identification in Chinese EMRs. Our study highlights the varying levels of awareness regarding privacy protection among EMRs users. While reliable methods for safeguarding privacy are extensively discussed in the literature29,30, their actual implementation within healthcare institutions remains crucial, particularly concerning sensitive STI information31,32. The 2021 Personal Information Protection Law of the People’s Republic of China defines the use of personal privacy data; however, prior efforts to establish robust standards for patient privacy protection in RWE research have been limited33. This study is the first in China to utilize a national-level EMRs database for a quantitative assessment of the risk of privacy exposure related to STI. Our goal is to enhance protection strategies and provide a foundation for future research and the improvement of privacy protocols.

Chinese EMRs present unique challenges in extracting STI-related privacy information due to specific terminology and data standards. Traditional methods relying on diagnostic codes are limited by incomplete records and inconsistent documentation. Physicians may not always record STI information as a diagnosis, leading to lower recall rates and potential bias. The complexity of coding systems in Chinese EMRs further complicates patient identification. The EPSTII method addresses these limitations by providing more precise results, significantly enhancing the extraction of STI information with high accuracy from both patient-based and visit-based records.

Our results show that traditional fixed-___location data masking procedures can lead to unnecessary exposure of privacy information. For example, patients’ nucleic acid test results, recorded in sub-datasets, are challenging to conceal, significantly impacting patients’ privacy. Accurate identification of STI information is crucial for comprehensive privacy protection34. The EPSTII method is the first to identify STI information across the entire Chinese EMRs system. It employs practical data desensitization techniques, such as data invalidation, data offset, and symmetric encryption, to prevent the misuse of private data. We have determined the optimal length of additional concealed text to retain most medical information. Quantifying the identification frequency of STI information helps researchers use EMRs wisely to avoid unnecessary privacy leakage, highlighting the richness of privacy information in EMRs. Although identification frequency alone cannot fully determine privacy leakage risk, these results underscore the importance of robust data desensitization strategies.

Limitations

Overall, this study underscores the importance of assessing privacy leakage risks and serves as a reference for effective privacy protection in Chinese EMRs. However, several limitations must be acknowledged. Firstly, the privacy risk estimates in our case study are primarily derived from the EMRs of a renal disease database. Despite the existence of official guidelines for EMRs documentation in China, discrepancies arise between the CRDS and other data networks in terms of data structure and operational environment. Therefore, specific protocols and variables should be optimized for broader applicability.

Additionally, our privacy protection method occasionally masks excessive information, such as the hepatitis B antibody levels of healthy patients, leading to the concealment of harmless data as well. This limitation stems from our reliance on regular expressions to enhance processing speed rather than employing more complex NLP techniques. Future research will explore the application of machine learning methods to better safeguard patient privacy.

Conclusions

Identifying an effective and practical method for protecting privacy information in EMRs is both significant and beneficial. We have demonstrated the feasibility of applying the EPSTII method to EMRs from 19 hospitals across various regions. We believe that EPSTII can provide valuable insights for patient inclusion in any research related to infectious diseases using Chinese EMRs. Our protocols, specifically tailored for the Chinese EMRs system, enable accurate and comprehensive identification and extraction of STI-related data, ensuring effective protection. Compared to traditional methods for including STI information, the EPSTII method yields more comprehensive results.