Medical Artificial Intelligence text Detection in Multilingual settings (MedAID-ML)

Data and Resources

README.txttxt
Explore
- Preview
- Download
dataset_test.jsonjson
Explore
- Preview
- Download
dataset_train.jsonjson
Explore
- Preview
- Download

Interoperability

RDF/XML (DCAT-AP)application/rdf+xml

Download

Groups

Additional Info

Field	Value
Identifier	http://hdl.handle.net/10261/389309
Author	Patrick Styll Leonardo Campillos-Llanos
Project
Name	Medical Artificial Intelligence text Detection in Multilingual settings (MedAID-ML)
Description	This dataset was created by gathering human-authored corpora from several public health sites and generating additional data via three different LLMs: GPT-4o, Mistral-7B and Llama3-1. We included texts in English, Spanish, German and French data from the biomedical domain. The current version gathers 50% AI-generated and 50% human-written texts. The following are the data we used: Cochrane Library: This is a database of meta-analyses and systematic reviews of updated results of clinical studies. We used abstracts of systematic reviews in all four languages. European Clinical Trials (EUCT): This agency that supervises and evaluates pharmaceutical products of the European Union (EU). We downloaded parallel data from public assessment reports (EPARs) from 12 new medicinal products, and data from clinical trial protocols and eligibility criteria. We ensured the data were published only from January 2025 to date. The goal was gathering data that might not have been used to train the LLMs in our experiments. European Medicines Agency (EMA): This agency that supervises and evaluates pharmaceutical products of the European Union (EU). We downloaded parallel data from public assessment reports (EPARs) from 12 new medicinal products, and data from clinical trial protocols and eligibility criteria. We ensured the data were published only from January 2025 to date. The goal was gathering data that might not have been used to train the LLMs in our experiments. European Food Safety Authority (EFSA): This website provides a comprehensive range of data about food consumption and chemical/biological monitoring data. We chose only the topics we deem necessary for our goals, therefore including a total of 51 topics. Processing: we manually split articles with a wordcount of above 1350 and manually ensured their correctness and alignment in all languages. European Vaccination Information Portal (EVIP): it provides up-to-date information on vaccines and vaccination. The factsheets are available in all languages, and consist of 20 texts each. Immunize: Immunize.org (formerly known as the Immunization Action Coalition) is a U.S.-based organization dedicated to providing comprehensive immunization resources for healthcare professionals and the public. Vaccine Information Sheets (VISs) have been translated into several languages, but not all of them contain all VISs. They are given as PDFs, with 25 in Spanish, French and English, but only 21 in German. Only PDFs overlapping in all languages were used. Migration und Gesundheit - German Ministry of Health (BFG): This portal provides multilingual health information tailored for migrants and refugees. Gesundheit für alle is a PDF file that provides a guide to the German healthcare system, and it is available in Spanish, English and German. Processing: Two topics, which were shorter than 100 words, were merged with the next one to ensure that context is preserved. Orphadata (INSERM): a comprehensive knowledge base about rare diseases and orphan drugs, in re-usable and high-quality formats, released in 12 official EU languages. We gathered definitions, signs and symptoms and phenotypes about 4389 rare diseases in English, German, Spanish and French. Processing: Since each definition is roughly the same size and similar format, we simply group 5 definitions together to make the text per topic longer. PubMed (National Library of Medicine): we downloaded abstracts available in English, Spanish, French and German. Wikipedia: a free, web-based, collaborative multilingual encyclopedia project; we selected (bio)medical contents available in English, German, Spanish and French. To ensure that the texts were not automatically generated, we only use articles that date back to before the release of ChatGPT, i.e. before 30th November 2022. Processing: some data cleaning was necessary; we also removed all topics with less than 5 words, or split those with more than 9 sentences into equally long parts. From these split up files, we make sure that they contain a minimum of 100 words, and we take only those contents or topics that exist in all three languages. [Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Patrick Styll, Leonardo Campillos-Llanos, Jorge Fernández-García, Isabel Segura-Bedmar (2025) "MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content". Under review. [Methods for processing the data] - Web-scraping of data from HTML content and PDF files available on the websites of the health contents. - Postprocessing and cleaning of data (e.g., removal of redundant white spaces or line breaks), and homogeneization of text length. - Generation of corresponding contents by means of generative AI using three large language models: GPT-4o, Mistral-7B and Llama3-1. - Formating of contents into JSON format. [Files] 1) JSON files: These are separated in TRAIN and TEST. Each file has a list of hashes for each text, and each hash contains the following fields: • text: the textual content. • data_source: the source repository of the text. • filename: the name of the original file from which the data were sourced. • source: label indicating if it is a human-written text (HUMAN) or the LLM used to generate the text ("gpt4o", "mistral" or "llama"). • "language": The language code of the text: German ("de"), English ("en"), Spanish ("es") or French ("fr"). • "target": a binary label to code if the text is written by humans ("0") or AI ("1"). • "ratio": The proportion of the text that was created with AI: "0.5" for AI-generated texts, and "null" for human texts. The corpus is made up of 13292 comparable and parallel texts in four languages: German, English, Spanish and French. The total token count is 3795449 tokens. This resource is aimed at training and evaluating models to detect medical texts created by means of generative artificial intelligence.
Themes	Science and technology Healthcare
Tags	AI-generated Text Generative AI Biomedical natural language processing Biomedical corpus
Creation date	2025-05-14T00:00:00
Last updated	2025-09-09T07:15:08
Refresh rate
Languages	Spanish English French
Geographic coverage
Geographic coverage (International)	Europe
Time coverage	From 2025-03-15 to 2025-03-15
Effective resource
Related resources	https://doi.org/10.1007/978-3-032-04354-2_5 https://github.com/Padraig20/MedAID-ML https://doi.org/10.20350/digitalCSIC/17276
Normative
Institute	Instituto de Lengua, Literatura y Antropología (ILLA), CSIC
Publisher	Publicador - Digital.CSIC
Observations	Recommended citation : Styll, Patrick; Campillos-Llanos, Leonardo; 2025; Medical Artificial Intelligence text Detection in Multilingual settings (MedAID-ML) [Dataset]; DIGITAL.CSIC; https://doi.org/10.20350/digitalCSIC/17276