CLARA-MeD corpus

Data and Resources

Interoperability

RDF/XML (DCAT-AP)application/rdf+xml

Download

Groups

Additional Info

Field	Value
Identifier	http://hdl.handle.net/10261/269887
Author	Leonardo Campillos-Llanos Ana Rosa Terroba Reinares Sofía Zakhir Puig Ana Valverde Mateos Adrián Capllonch Carrión
Project	PID2020-116001RA-C33
Name	CLARA-MeD corpus
Description	A collection of 24.298 pairs of professional and simplified texts (>96 million tokens): 1) Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words); 2) Cancer-related information summaries (201 pairs of texts, >3M tokens); and 2) Clinical trials announcements (5748 pairs of texts, 451 690 tokens). The dataset also contains a parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens). This is a benchmark for medical text simplification. The latest download of files was in February 2022.
Themes	Science and technology Healthcare
Tags	Comparable corpus Parallel sentences Medical text simplification Biomedical natural language processing
Creation date	2022-05-19T00:00:00
Last updated	2022-05-19T00:00:00
Refresh rate
Languages	Spanish English
Geographic coverage	Spain
Geographic coverage (International)	Europe and United States of America
Time coverage	From 2022-05-15 to 2022-05-15
Effective resource
Related resources	https://github.com/lcampillos/CLARA-MeD/corpus
Normative
Institute	Instituto de Lengua, Literatura y Antropología (ILLA), CSIC
Publisher	Publicador - Digital.CSIC
Observations	Recommended citation for this dataset: Campillos-Llanos, Leonardo; Terroba Reinares, Ana Rosa; Zakhir Puig, Sofía; Valverde Mateos, Ana; Capllonch Carrión, Adrián; 2022; CLARA-MeD corpus [Dataset]; DIGITAL.CSIC; https://doi.org/10.20350/digitalCSIC/14644