Medical Artificial Intelligence text Detection in Multilingual settings (MedAID-ML)
Data and Resources
Interoperability
Groups
Additional Info
Field | Value |
Identifier | http://hdl.handle.net/10261/389309 |
---|---|
Author | |
Project | |
Name | Medical Artificial Intelligence text Detection in Multilingual settings (MedAID-ML) |
Description |
This dataset was created by gathering human-authored corpora from several public health sites and generating additional data via three different LLMs: GPT-4o, Mistral-7B and Llama3-1. We included texts in English, Spanish, German and French data from the biomedical domain. The current version gathers 50% AI-generated and 50% human-written texts. The following are the data we used:
[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Patrick Styll, Leonardo Campillos-Llanos, Jorge Fernández-García, Isabel Segura-Bedmar (2025) "MedAID-ML: A Multilingual Dataset of Biomedical Texts for Detecting AI-Generated Content". Under review. [Methods for processing the data] - Web-scraping of data from HTML content and PDF files available on the websites of the health contents. - Postprocessing and cleaning of data (e.g., removal of redundant white spaces or line breaks), and homogeneization of text length. - Generation of corresponding contents by means of generative AI using three large language models: GPT-4o, Mistral-7B and Llama3-1. - Formating of contents into JSON format. [Files] 1) JSON files: These are separated in TRAIN and TEST. Each file has a list of hashes for each text, and each hash contains the following fields: • text: the textual content. • data_source: the source repository of the text. • filename: the name of the original file from which the data were sourced. • source: label indicating if it is a human-written text (HUMAN) or the LLM used to generate the text ("gpt4o", "mistral" or "llama"). • "language": The language code of the text: German ("de"), English ("en"), Spanish ("es") or French ("fr"). • "target": a binary label to code if the text is written by humans ("0") or AI ("1"). • "ratio": The proportion of the text that was created with AI: "0.5" for AI-generated texts, and "null" for human texts. The corpus is made up of 13292 comparable and parallel texts in four languages: German, English, Spanish and French. The total token count is 3795449 tokens. This resource is aimed at training and evaluating models to detect medical texts created by means of generative artificial intelligence. |
Themes |
|
Tags | |
Creation date | 2025-05-14T00:00:00 |
Last updated | 2025-09-09T07:15:08 |
Refresh rate | |
Languages |
|
Geographic coverage |
|
Geographic coverage (International) | Europe |
Time coverage |
|
Effective resource | |
Related resources | |
Normative |
|
Institute | |
Publisher | Publicador - Digital.CSIC |
Observations |
Recommended citation : Styll, Patrick; Campillos-Llanos, Leonardo; 2025; Medical Artificial Intelligence text Detection in Multilingual settings (MedAID-ML) [Dataset]; DIGITAL.CSIC; https://doi.org/10.20350/digitalCSIC/17276 |