Abstract
Recent advancements in Large Language Models (LLMs) suggest imminent commercial applications of such AI systems which serve as gateways to interact with technology and the accumulated body of human knowledge. However, whether and how ChatGPT contains bias in behavior detection and thus generates ethical concerns in decision-making deserves explored. This study therefore aims to evaluate and compare the detection accuracy and gender bias between ChatGPT and such traditional ML methods as Naïve Bayes, SVM, Random Forest, and XGBoost. The experimental results demonstrate that despite of its lower detection accuracy, ChatGPT exhibits less gender bias. Furthermore, we observed that removing gender label features led to an overall reduction in bias for traditional ML methods, while for ChatGPT, the gender bias decreased when labels were provided. These findings shed new light on fairness enhancement and bias mitigation in management decisions when using ChatGPT for behavior detection.
Similar content being viewed by others
Introduction
In recent years, chatbots have become increasingly popular as they provide a convenient and efficient means to interact with computer systems. One prominent example of these chatbots is ChatGPT, an advanced language model developed by OpenAI. While ChatGPT possesses remarkable capabilities in generating coherent and contextually relevant responses, does it exhibit gender bias when it is used in the decision-making process? The impact of bias in chatbots like ChatGPT is a matter of concern, as these systems are increasingly being integrated into various applications, including customer support, information retrieval, spam detection, and resume filtering to detect behavior. Biased responses from such systems can perpetuate stereotypes, reinforce societal prejudices, and contribute to the marginalization of certain groups. Therefore, it is crucial to identify, understand, and address gender bias in ChatGPT to ensure fairness, inclusivity, and ethical use of these powerful language models.
In the realm of decision-making, fairness means that there is no bias or favoritism based on inherent or acquired characteristics of individuals or groups during the decision-making process. However, similar to human, algorithms are also prone to bias, thereby rendering their decisions susceptible to perceptions of “unfairness.” A prominent illustration of bias in algorithmic decision-making is evident in the case of “Correctional Offender Management Profiling for Alternative Sanctions” (COMPAS), a tool employed within the United States judicial system to aid in parole decisions. COMPAS utilizes software algorithms to assess the risk of individuals reoffending, thereby providing judges with insights to determine whether an offender should be released or incarcerated. Subsequent investigations into the functionality of COMPAS have unearthed biases against African Americans, with the software being more inclined to assign higher risk scores to individuals of African American descent when compared to Caucasian counterparts possessing similar characteristics.
In the field of machine learning, a mainstream task is text classification. Extant studies focus on the examination of biases against groups explicitly mentioned within the text content. For instance, studies like (Dixon et al., 2018; Park et al., 2018; Zhang et al., 2020) delve into the discriminatory behavior exhibited by text classification models when encountering texts containing demographic terms such as “gay” or “Muslim.” In these instances, the biases observed are directly linked to the demographic attributes of individuals, which are overtly present within the text, a phenomenon commonly referred to as explicit biases. Studies on explicit biases in text classification has garnered significant attention within the research community and has been extensively explored and analyzed.
In addition to explicit biases, text classification models have the potential to exhibit biases towards texts authored by individuals belonging to specific population groups. These biases, which we term implicit biases, have garnered less attention and thus remain relatively understudied (Liu et al., 2021). However, it is essential to recognize that biases in text can manifest in subtler and more covert manners. Even in instances where the text does not explicitly reference any particular groups or individuals, its content may inadvertently provide insights into the author’s demographic characteristics. Studies conducted by Coulmas (2013) and Preoţiuc-Pietro, Ungar (2018) shed light on this phenomenon, highlighting the strong correlation between the language style employed in a text (including word choice and tone) and the author’s demographic attributes such as age, gender, and race. As text classifiers are exposed to such data, they can learn to associate the content with the author’s demographic information, potentially leading to biased decisions that disadvantage certain groups.
In this study, our primary focus lies in investigating the presence and impact of implicit gender biases within ChatGPT and traditional machine learning models, thereby contributing to a deeper understanding of the implicit gender biases that can emerge within text-based models. Specifically, we conducted experiments using two publicly available real-world text datasets, namely hate speech and toxic comments datasets. Then we compared ChatGPT with traditional ML methods, including Naïve Bayes, SVM, Random Forest, and XGBoost. The experimental results demonstrate that despite its shortcomings in detection accuracy, ChatGPT exhibits a lower degree of gender bias. Furthermore, we observed that removing demographic gender label features led to an overall reduction in bias levels for traditional ML methods, while for ChatGPT, the gender bias levels decreased when labels were provided. Addressing the gender bias issue in ChatGPT, this study provides in-depth empirical analysis and comparisons, offering valuable insights for further enhancing the fairness of NLP models and mitigating biases.
Methods
Fairness depends on context, thus, a large variety of fairness metrics exists. Within the realm of computer science research, scholars have introduced over 20 distinct metrics of fairness, highlighting the complexity and multidimensionality of this concept (Verma and Rubin, 2018; Žliobaitė, 2017). However, it is important to note that there is no universally applicable fairness metrics that can be universally employed across all scenarios and contexts (Foster et al., 2016; Verma and Rubin, 2018). Given this inherent challenge, in order to provide a more comprehensive understanding of the experimental results, we present the outcomes using two prominent fairness metrics: “Equalized odds” and “Error Rate Equality Difference”. These metrics have gained considerable recognition within the research community and are frequently employed in assessing and quantifying fairness in algorithmic decision-making systems. By reporting on these two metrics, we aim to provide a meaningful evaluation of the fairness considerations within our experimental framework.
Error rate equality difference
The Equalized inspires the error rate equality difference metrics, which use the variation in these error rates between terms to measure the extent of unintended bias in the model, similar to the Equality gap metric used in (Beutel et al., 2017).
Using the identity phrase test set, we calculate the false positive rate (FPR) and false negative rate (FNR) on the entire test set, as well as these same metrics on each subset of the data containing each specific identity term, FPRt and FNRt. A more fair model will have similar values across all terms, approaching the equality of odds ideal, where FPR = FPRt and FNR = FNRt for all group t. Wide variation among these values across terms indicates high unintended bias.
Error rate equality difference quantifies the extent of the per-group variation (and therefore the extent of unintended bias) as the sum of the differences between the overall false positive or negative rate and the per-group values, as shown in Equations 1 and 2.
Evaluation metrics
We use Accuracy, Precision, Recall, and F1 Score to measure overall performance. To evaluate group fairness, we measure the equality differences (ED) of false positive/negative rates (Dixon et al., 2018) for the fair evaluation. Existing studies show the FP/FN-ED is an ideal choice to evaluate fairness in classification tasks (Czarnowska et al., 2021). Taking the FPR as an example, we calculate the equality difference by \({FPED}=\sum _{g\in G}\left|{{FPR}}_{d}-{FPR}\right|\), where G is the gender and d is a gender group (e.g., female). We report the sum of \({FPED}\) and \({FNED}\) scores and denote as “SUM-ED”. This metric sums up the differences between the rates within specific gender groups and the overall rates.
Results
Data
We investigated various text classification tasks and datasets that included different author demographic populations to analyze whether ChatGPT exhibits implicit gender biases. Specifically, we used two publicly available real-world datasets, namely the Multilingual Twitter Corpus (MTC) introduced by Huang et al. (2020) and the Jigsaw Unintended Bias in Toxicity Classification dataset released on Kaggle.
The MTC dataset (The hate speech dataset) consists of multilingual tweets used for hate speech detection tasks. Each tweet is annotated as either “hate speech” or “non-hate speech” and is associated with four demographic attributes of the authors: race, gender, age, and country. We utilized the English corpus with gender attributes in this dataset, which consists of two categories: male and female.
The Jigsaw dataset (The toxic comments dataset) contains text from personal comments that could be perceived as toxic (offensive, vulgar, or abusive). The text of individual comments is located in the comment_text column. Each comment in the dataset is labeled with a toxicity target (0/1), and the model is expected to predict the target toxicity. Additionally, the dataset also includes identity information of the text authors, especially gender attribute labels.
Table 1 shows descriptive statistics for two datasets, and we can see that the data on gender is well balanced.
Experiment
In this study, we utilize the API of ChatGPT, specifically the gpt-turbo model, to develop an automated inquiry program (Promopt is in the following format, “Determine if the following paragraphs contain hate speech (only answer ‘1’ or ‘0’, where 1 indicates hate speech and 0 indicates no hate speech): <The specific paragraph in the dataset > .”). The primary objective of this program is to assess hate speech and toxic comments. Each comment containing potentially offensive content is presented to ChatGPT as input, and we prompt the ChatGPT to determine whether it is a hate speech/toxic comment. The result is simplified into a binary representation, with 0 denoting the absence of hate speech or toxic comment, and 1 indicating its presence. Subsequently, we meticulously record and store the outcomes for analysis.
Data preprocessing
We considered both under-sampling of the majority class and over-sampling of the minority class to create a more balanced dataset. This approach helps ensure that the conclusions are not affected by the dataset imbalance and the model evaluation is more reliable. We used random sampling for both datasets, guaranteeing the proportion of positive and negative samples was consistent. Specifically, we randomly sampled 4000 positive and 4000 negative samples from each dataset for the experiments.
To establish a comparative framework, we also employ traditional machine learning techniques (including Naïve Bayes, SVM, Random Forest, and XGBoost) as a benchmark. Initially, documents are lowercased and tokenized using NLTK (Bird and Loper, 2004), then we randomly partition the dataset into distinct training and testing sets. The training set is utilized to train the machine learning model, enabling it to learn patterns and features associated with hate speech and toxic comments. Following the training phase, the model’s predictive capabilities are evaluated using the testing set.
To ensure a systematic assessment, we further categorized the experiments into two distinct types, namely “Yes_label” and “No_label”. Within the “Yes_label” category, we intentionally furnished ChatGPT with the gender labels of the text authors as additional input (Promopt is in the following format, “Determine if the following paragraphs contain hate speech (only answer ‘1’ or ‘0’, where 1 indicates hate speech and 0 indicates no hate speech): The <man/woman> said that, <the specific paragraph in the dataset > .”), while traditional machine learning models were trained to incorporate the gender labels of the text authors. Conversely, in the “No_label” type, neither ChatGPT nor the traditional machine learning models were provided with any information regarding the gender labels associated with the text authors. This segregation allows for a comparative analysis of the performance between the two approaches under controlled conditions, with and without the availability of gender label information.
Outcome
Firstly, we conducted experiments on Dataset 1 (hate speech rejection tasks). Figure 1 exhibits the comprehensive compilation of average experimental outcomes achieved through the multiple utilization of both ChatGPT and traditional machine learning methodologies. We extensively measured evaluation metrics such as Accuracy, Precision, Recall, and F1 Score to assess prediction accuracy, as well as fairness evaluation metrics including False Positive, False Negative, FPED, FNED, and SUM-ED, the detailed results are presented in Fig. 2 and Table 2.
Based on the experimental results of the MTC dataset (The hate speech dataset) we can obtain the following findings (see Fig. 1, Table 2, and Fig. 2). Firstly, in terms of classifying English hate speech, ChatGPT performs lower than Naive Bayes, SVM, Random Forest, and XGBoost in terms of Accuracy, Recall, and F1 Score, but it exhibits relatively higher Precision. Several studies have pointed out that ChatGPT may exhibit a conservative approach when performing detection tasks, particularly in tasks related to detecting harmful content. For instance, some studies have shown that ChatGPT may display certain biases when detecting harmful content, especially in cases involving politically sensitive topics or comments from specific demographic groups (Zhu et al., 2023; Li et al., 2024; Deshpande et al., 2023; Clews, 2024; Zhang, 2024). Additionally, due to the model’s training data and methods, some biases may unintentionally be introduced, causing the model to behave more conservatively in certain situations (Hou et al., 2024). Secondly, in terms of bias evaluation metrics such as FPED, FNED, and SUM-ED, ChatGPT demonstrates relatively lower gender bias compared to Naive Bayes, SVM, Random Forest, and XGBoost. Finaly, when the gender label feature is removed, Naive Bayes (SUM-ED:0.0819 to 0.0721), SVM (SUM-ED:0.0726 to 0.0687), Random Forest (SUM-ED:0.0723 to 0.0721), and XGBoost (SUM-ED:0.0691 to 0.0682) generally show a decrease in bias level. However, GPT-4 (SUM-ED:0.0135 to 0.0553)/GPT-3.5 (SUM-ED:0.0175 to 0.0650) shows an increase in bias level when gender attributes are not provided.
Similarly, we conducted the same experiment again on The MTC dataset (The hate speech dataset) and found similar conclusions (see Fig. 3, Table 3, and Fig. 4). Firstly, in classifying toxic English comments, ChatGPT performs lower than Naive Bayes, SVM, Random Forest, and XGBoost in terms of Accuracy, Precision, Recall, and F1 Score. Secondly, in terms of discrimination evaluation metrics such as FPED and FNED, ChatGPT demonstrates relatively lower gender bias compared to Naive Bayes, SVM, and XGBoost (except for Random Forest). Finally, when the gender label feature is removed, Naive Bayes (SUM-ED:0.3186 to 0.2377), SVM (SUM-ED:0.1472 to 0.1282), Random Forest (SUM-ED:0.1028 to 0.0860), and XGBoost (SUM-ED:0.1632 to 0.1407) generally show a decrease in bias level, while GPT-4 (SUM-ED:0.1025 to 0.1323)/ GPT-3.5 (SUM-ED:0.1280 to 0.1640) shows an increase in bias level when gender attributes are not provided.
In general, ChatGPT exhibits lower accuracy levels when compared to its traditional machine learning counterparts, however, an aspect that warrants attention is the relatively low degree of bias demonstrated by ChatGPT, particularly when provided with demographic attribute feature labels. We further endeavored to provide a plausible explanation for the results. Regarding accuracy, ChatGPT’s recognition accuracy has decreased due to a lack of sufficient learning on hate speech/toxic comment datasets. In the case of traditional machine learning, numerous research experiments have indicated that a viable approach to reducing bias is delabeling (Mehrabi et al., 2022; Corbett-Davies et al., 2023). However, for ChatGPT, no research to date has explored the impact of demographic gender labels on its performance. In this experiment, the results demonstrate that when ChatGPT is provided with accurate demographic gender labels and subsequently tasked with determining whether a statement qualifies as hate speech/toxic comment, the degree of bias decreases. One hypothesis is that ChatGPT incorporates a “built-in resistance” to sensitive information such as gender within its design structure, thereby “consciously” mitigating the influence of this bias. We inquired ChatGPT about this, and it confirmed that the algorithms actively counteract gender bias, which could explain the gap between known and unknown gender attributes. Some studies indicate that ChatGPT demonstrates built-in resistance when processing and generating text, striving to avoid the generation and dissemination of gender bias (Fang et al., 2024). Besides, we tend to believe that ‘built-in resistance’ may be related to ChatGPT’s robustness. Wang et al. (2023) conducted a thorough evaluation of the robustness of ChatGPT from the adversarial and out-of-distribution (OOD) perspective, and the results indicate that ChatGPT shows consistent advantages on most adversarial and OOD classification and translation tasks. However, despite this built-in resistance, it cannot eliminate gender bias completely. For example, some studies using artificially constructed test cases found that ChatGPT falls short in terms of gender equality and exhibits consistency issues across different versions (Geiger et al., 2024; Fang et al., 2024).
Discussion
In this study, a series of experiments were meticulously conducted on diverse datasets, enabling a comprehensive evaluation of the gender bias issue of ChatGPT and traditional machine learning methods.
The findings of our study unveil an intriguing insight: ChatGPT exhibits lower detection accuracy when compared to its traditional machine learning counterparts, however, an aspect that warrants attention is the relatively low degree of gender bias demonstrated by ChatGPT, particularly when provided with demographic attribute feature labels. This observation holds significant implications for the practical implementation of ChatGPT within enterprise settings. For instance, in scenarios where an enterprise software intends to employ ChatGPT as an auxiliary tool for the identification of hate speech and toxic comments, it can be reasonably posited that ChatGPT possesses a comparably low degree of gender bias. Furthermore, to optimize the performance of ChatGPT in terms of gender bias reduction, it is advisable to furnish the model with demographic attribute feature labels, thereby enabling it to make more informed decisions with lower bias.
This research sheds light on the potential of leveraging ChatGPT to reduce gender bias within enterprise applications while simultaneously emphasizing the importance of augmenting its capabilities through the provision of relevant demographic attribute information. We recommend that OpenAI could further enhance the model by improving its ability to automatically detect and mitigate biases related to sensitive attributes such as gender or race, and introduce external regulatory audits that could help maintain objectivity in assessing fairness and bias mitigation. Besides, optimizing model architectures and fine-tuning parameters to further mitigate biases, integrating real-time feedback mechanisms in enterprise applications, and exploring multimodal data integration can enhance the robustness of bias detection. Future research should focus on expanding the scope of bias analysis beyond gender to include race, age, and other demographic attributes, employing larger and more diverse datasets to validate findings across different real-world scenarios. Additionally, we encourage researchers to conduct broader studies on other LLMs, such as LLaMA, Gemini, or Mistral, focusing on issues of discrimination related to gender, race, and other factors in the future.
Data availability
Public datasets are used in this study and the links are: https://github.com/xiaoleihuang/Multilingual_Fairness_LREC. https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification.
References
Beutel A, Chen J, Zhao Z, Chi EH (2017) Data decisions and theoretical implications when adversarially learning fair representations. arXiv Preprint arXiv:1707.00075
Bird S, Loper E (2004) NLTK: the natural language toolkit. The 42nd Annual Meeting of the Association for Computational Linguistics
Clews R (2024) Understanding how ChatGPT’s political bias impacts hate speech and offensive language detection: A content analysis. Young Res. 8(1):178–191
Corbett-Davies S, Gaebler JD, Nilforoshan H, Shroff R, Goel S (2023) The measure and mismeasure of fairness. J. Mach. Learn. Res. 24(1):14730–14846
Coulmas F (2013) Sociolinguistics: The study of speakers’ choices. Cambridge University Press
Czarnowska P, Vyas Y, Shah K (2021) Quantifying social biases in NLP: A generalization and empirical comparison of extrinsic fairness metrics. Trans. Assoc. Comput. Linguist. 9:1249–1267
Deshpande A, Murahari V, Rajpurohit T, Kalyan A, Narasimhan K (2023) Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv Preprint arXiv:2304.05335
Dixon L, Li J, Sorensen J, Thain N, Vasserman L (2018) Measuring and Mitigating Unintended Biasin Text Classification
Fang X, Che S, Mao M, Zhang H, Zhao M, Zhao X (2024) Bias of AI-generated content: an examination of news produced by large language models. Sci Rep 14(1):5224
Foster I, Ghani R, Jarmin R S, Kreuter F, Lane J (2016) Big data and social science: A practical guide to methods and tools. Chapman and Hall/CRC
Geiger RS, O’Sullivan F, Wang E, Lo J (2024) Asking an AI for salary negotiation advice is a matter of concern: Controlled experimental perturbation of ChatGPT for protected and non-protected group discrimination on a contextual task with no clear ground truth answers. arXiv Preprint arXiv:2409.15567
Hou H, Leach K, Huang Y (2024) ChatGPT Giving Relationship Advice–How Reliable Is It? Proceedings of the International AAAI Conference on Web and Social Media, 610–623
Huang X, Xing L, Dernoncourt F, Paul MJ (2020) Multilingual twitter corpus and baselines for evaluating demographic bias in hate speech recognition. arXiv Preprint arXiv:2002.10361
Li L, Fan L, Atreja S, Hemphill L (2024) “HOT” ChatGPT: the promise of ChatGPT in detecting and discriminating hateful, offensive, and toxic comments on social media. ACM Trans. Web 18(2):1–36
Liu H, Jin W, Karimi H, Liu Z, Tang J (2021) The Authors Matter: Understanding and Mitigating Implicit Bias in Deep Text Classification. http://arxiv.org/abs/2105.02778
Mehrabi N, Morstatter F, Saxena N, Lerman K, Galstyan A (2022) A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 54(6):1–35
Park JH, Shin J, Fung P (2018) Reducing Gender Bias in Abusive Language Detection. http://arxiv.org/abs/1808.07231
Preoţiuc-Pietro D, Ungar L (2018) User-level race and ethnicity predictors from Twitter text. In Proceedings of the 27th International Conference on Computational Linguistics, 1534–1545
Verma S, Rubin J (2018) Fairness definitions explained. Proceedings of the International Workshop on Software Fairness, 1–7. https://doi.org/10.1145/3194770.3194776
Wang J, Hu X, Hou W, Chen H, Zheng R, Wang Y, Yang L, Huang H, Ye W, Geng X (2023) On the robustness of chatgpt: An adversarial and out-of-distribution perspective. arXiv Preprint arXiv:2302.12095
Zhang G, Bai B, Zhang J, Bai K, Zhu C, Zhao T (2020) Demographics Should Not Be the Reason of Toxicity: Mitigating Discrimination in Text Classifications with Instance Weighting. http://arxiv.org/abs/2004.14088
Zhang J (2024) ChatGPT as the Marketplace of Ideas: Should Truth-Seeking Be the Goal of AI Content Governance? arXiv Preprint arXiv:2405.18636
Zhu Y, Zhang P, Haq EU, Hui P, Tyson G (2023) Can chatgpt reproduce human-generated labels? A study of social computing tasks. arXiv Preprint arXiv:2304.10145
Žliobaitė I (2017) Measuring discrimination in algorithmic decision making. Data Min. Knowl. Discov. 31(4):1060–1089
Acknowledgements
This research was supported by the National Natural Science Foundation of China (72322020, 72374226), and the Guangdong Basic and Applied Basic Research Foundation (2020B1515020031, 2023B1515020073).
Author information
Authors and Affiliations
Contributions
Conceptualization: Ji Wu and Doris Chenguang Wu; methodology: Ji Wu and Yaokang Song; formal analysis: Yaokang Song; investigation: Ji Wu and Doris Chenguang Wu; visualization: Yaokang Song; writing-original draft preparation: Yaokang Song; writing-review and editing: Ji Wu and Doris Chenguang Wu; project administration: Ji Wu and Doris Chenguang Wu. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Informed consent
This article does not contain any studies with human participants performed by any of the authors.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wu, J., Song, Y. & Wu, D.C. Does ChatGPT show gender bias in behavior detection?. Humanit Soc Sci Commun 11, 1706 (2024). https://doi.org/10.1057/s41599-024-04219-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1057/s41599-024-04219-3