CLARA-MeD corpus

Data and Resources

Interoperability


Groups


Additional Info

Field Value
Identifier http://hdl.handle.net/10261/269887
Author
Project
Name CLARA-MeD corpus
Description

A collection of 24.298 pairs of professional and simplified texts (>96 million tokens): 1) Drug leaflets and summaries of product characteristics (10 211 pairs of texts, >82M words); 2) Cancer-related information summaries (201 pairs of texts, >3M tokens); and 2) Clinical trials announcements (5748 pairs of texts, 451 690 tokens). The dataset also contains a parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens). This is a benchmark for medical text simplification. The latest download of files was in February 2022.

Themes
  • Science and technology
  • Healthcare
Tags
Creation date 2022-05-19T00:00:00
Last updated 2022-05-19T00:00:00
Refresh rate
Languages
  • Spanish
  • English
Geographic coverage Spain
Geographic coverage (International) Europe and United States of America
Time coverage
  • From 2022-05-15 02:00 to 2022-05-15 02:00
Effective resource
Related resources
Normative
    Institute
    Publisher Publicador - Digital.CSIC
    Observations

    Recommended citation for this dataset: Campillos-Llanos, Leonardo; Terroba Reinares, Ana Rosa; Zakhir Puig, Sofía; Valverde Mateos, Ana; Capllonch Carrión, Adrián; 2022; CLARA-MeD corpus [Dataset]; DIGITAL.CSIC; https://doi.org/10.20350/digitalCSIC/14644