Language and cultural bias in AI: comparing the performance of large language models developed in different countries on Traditional Chinese Medicine highlights the need for localized models

Zhu, Lingxuan; Mou, Weiming; Lai, Yancheng; Lin, Junda; Luo, Peng

doi:10.1186/s12967-024-05128-4

Letter to the Editor
Open access
Published: 29 March 2024

Language and cultural bias in AI: comparing the performance of large language models developed in different countries on Traditional Chinese Medicine highlights the need for localized models

Lingxuan Zhu¹^na1,
Weiming Mou²^na1,
Yancheng Lai¹^na1,
Junda Lin¹ &
…
Peng Luo ORCID: orcid.org/0000-0002-8215-2045¹

Journal of Translational Medicine volume 22, Article number: 319 (2024) Cite this article

1465 Accesses
1 Citations
10 Altmetric
Metrics details

To the Editor,

Large language models (LLMs) are AI systems trained on vast amounts of text data to understand human language and interact with humans in natural language. ChatGPT, a well-known example of LLMs developed by OpenAI, has demonstrated powerful capabilities across many domains, including passing various standardized medical qualification exams such as the United States Medical Licensing Examination (USMLE) [1]. Studies suggest its potential to enhance medical education and aid clinical decision-making [2]. Although ChatGPT can answer questions in multiple languages, it should be noted that, like many Western-developed LLMs, the vast majority of its training data is in English [3], and many studies on its capabilities in the medical field is based on a Western medical context. Traditional Chinese Medicine (TCM), rooted in millennia of Chinese wisdom, differs fundamentally from Western medicine. Due to the relatively small proportion of Chinese in the training corpus, models develpoed by Western countries, such as ChatGPT, may lack sufficient exposure to concepts and terminology from TCM, casting doubt on their application in TCM. Chinese companies have developed their own LLMs, such as Baidu’s Ernie Bot series, ZHIPU AI’s GLM-4, and Alibaba’s Qwen-max, which leveraged extensive Chinese language data in their training [4]. We compare the accuracy of LLMs developed by China and the West in answering TCM-related questions to explore the capabilities of LLMs in understanding and applying domain-specific knowledge across different languages and cultural backgrounds.

The National Medical Licensing Examination for Traditional Chinese Medicine (TCM) is the qualification entrance exam for TCM practitioners in China and consists of multiple-choice questions divided into four units. We utilized 140 questions from the first unit of the 2022 examination, which assesses fundamental TCM subjects such as Fundamentals of TCM, Diagnostics, Chinese Materia Medica, and Herbal Formulas. Questions related to Health Laws and Regulations were excluded as they are unrelated to TCM. Eight prominent LLMs were included in our study: four from Chinese companies (Ernie Bot [Baidu], Ernie Bot-4 [Baidu], Qwen-max [Alibaba], and GLM-4 [ZHIPU AI]) and four from western countries (ChatGPT-3.5 [OpenAI], ChatGPT-4 [OpenAI], Claude-2 [Anthropic], and Gemini-pro [Google]). To ensure models correctly identified the questions as TCM-related, we added “In traditional Chinese medicine theory” to the beginning of each question and submitted the questions and options together to the LLMs. Responses were collected on January 2, 2024 via API (default parameters). All interactions were conducted in Chinese. To mitigate the impact of inherent randomness in LLMs on the evaluation, each model generated three answers per question. Correctness was defined as answering questions correctly twice or more.

The performance of eight LLMs on the TCM exam is shown in Fig. 1A. The average accuracy of LLMs developed by Chinese companies (78.4%) was higher than that of those developed by Western companies (35.9%, Wilcoxon test, p < 0.05). Among the Chinese models, Qwen-max achieved the highest accuracy of 86.4%, followed by GLM-4 (80.7%) and Erine Bot-4 (78.6%). Most of the LLMs developed in China outperformed LLMs developed in the West (Fig. 1B. McNemar's test with Bonferroni correction). All Chinese models passed the exam with accuracy rates above 60%, whereas all western models failed. We further evaluated whether the models are better at answering questions from a particular subject, and the result showed no significant difference for most models. However, for Ernie Bot and Qwen-max, there was a significant difference when requiring at least 2 or 3 correct responses (Additional file 1: Table S1).

Our study highlights the impact of language and cultural biases on LLMs’ performance in the context of TCM. The difference in performance may stem from Western LLMs being primarily trained on English datasets, lacking deep familiarity with Chinese culture, language nuances, and TCM concepts. This challenge extends beyond the Chinese context, as evidenced by ChatGPT-3.5's failure in the Japanese medical licensing exam [5]. To address these limitations, we suggest developing localized LLMs trained on local data or prioritizing multilingual and cross-cultural training data. In China, domestically developed LLMs can provide more accurate interpretations of medical knowledge based on the characteristics of the Chinese population and current clinical practice in China. Such models can also understand the linguistic nuances, idioms, colloquialisms, and cultural aspects of Chinese, offering services that better meet user needs within China's sociocultural context. Moreover, independently developed LLMs can improve data security and privacy protection while helping to preserve and transmit Chinese culture.

In conclusion, our exploration underscores the importance of developing LLMs tailored to specific languages and cultural contexts, particularly in the field of TCM. With continuous localization optimizations and advancements in AI, China's LLMs could make significant contributions across various fields, unlocking endless possibilities.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Abbreviations

AI:: Artificial Intelligence
LLM:: Large language model
TCM:: Traditional Chinese Medicine
USMLE:: United States Medical Licensing Examination

References

Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2: e0000198. https://doi.org/10.1371/journal.pdig.0000198.
Article PubMed PubMed Central Google Scholar
Zhu L, Mou W, Chen R. Can the ChatGPT and other large language models with internet-connected database solve the questions and concerns of patient with prostate cancer and help democratize medical knowledge? J Transl Med. 2023;21:269. https://doi.org/10.1186/s12967-023-04123-5.
Article PubMed PubMed Central Google Scholar
gpt-3/dataset_statistics/languages_by_word_count.csv at master openai/gpt-3. In: GitHub. https://github.com/openai/gpt-3/blob/master/dataset_statistics/languages_by_word_count.csv. Accessed 24 Nov 2023
ERNIE Bot: Baidu’s knowledge-enhanced large language model built on full AI stack technology. http://research.baidu.com/Blog/index-view?id=183. Accessed 3 Jan 2024.
Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ. 2023;9: e48002. https://doi.org/10.2196/48002.
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Lingxuan Zhu, Weiming Mou and Yancheng Lai have contributed equally to this work and share first authorship.

Authors and Affiliations

Department of Oncology, Zhujiang Hospital, Southern Medical University, 253 Industrial Avenue, Guangzhou, 510282, China
Lingxuan Zhu, Yancheng Lai, Junda Lin & Peng Luo
Department of Urology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, 100 Haining Road, Hongkou District, Shanghai, China
Weiming Mou

Authors

Lingxuan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Weiming Mou
View author publications
You can also search for this author in PubMed Google Scholar
Yancheng Lai
View author publications
You can also search for this author in PubMed Google Scholar
Junda Lin
View author publications
You can also search for this author in PubMed Google Scholar
Peng Luo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Lingxuan Zhu, Weiming Mou: Conceptualization, investigation, writing—original draft, methodology, literature review, writing—review and editing, visualization. Yancheng Lai: methodology, literature review, writing—review and editing, visualization. Junda Lin: conceptualization, investigation, writing—original draft, literature review. Peng Luo: conceptualization, literature review, project administration, supervision, resources, writing—review and editing, funding acquisition. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Peng Luo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential competing interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1.

Performance of Ernie Bot, Ernie Bot-4, Qwen-max, GLM-4, ChatGPT-3.5, ChatGPT-4, Claude-2 and Gemini-pro on National Medical Licensing Examination for Traditional Chinese Medicine (TCM). Statistical significance was assessed using Fisher’s exact test. If statistical differences were observed, subgroup analysis were performed using Fisher’s exact test with Bonferroni correction for multiple comparisons to evaluate the accuracy rate of one subject against the combined accuracy rate of the remaining three subjects within the same model.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Zhu, L., Mou, W., Lai, Y. et al. Language and cultural bias in AI: comparing the performance of large language models developed in different countries on Traditional Chinese Medicine highlights the need for localized models. J Transl Med 22, 319 (2024). https://doi.org/10.1186/s12967-024-05128-4

Download citation

Received: 18 March 2024
Accepted: 23 March 2024
Published: 29 March 2024
DOI: https://doi.org/10.1186/s12967-024-05128-4

Language and cultural bias in AI: comparing the performance of large language models developed in different countries on Traditional Chinese Medicine highlights the need for localized models

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary Information

Additional file 1: Table S1.

Rights and permissions

About this article

Cite this article

Share this article

Journal of Translational Medicine

Contact us