You're currently viewing an old version of this dataset. To see the current version, click here.

Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp)

Data and Resources

Interoperability


Additional Info

Field Value
Identifier https://hdl.handle.net/10261/373675
Author
Project
Name Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp)
Description

[Description of methods used for collection/generation of data] The corpus statistics and methods are explained in the following article: Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) "Lexical Simplification in Spanish Texts For Patients: The Complex Word Identification Task". (Under review). [Methods for processing the data] Manual annotation of complex words (CW) according to the criteria defined in the guideline explained in the companion article.

We greatly thank the following colleagues who doubly revised a subset of texts in order to compute the inter-annotator agreement: Ana R. Terroba-Reinares (Fundación Rioja Salud) [ORCID: 0000-0003-1582-6481]; Ana Valverde-Mateos (Unidad de Terminología Médica, Real Academia Nacional de Medicina de España) [ORCID: 0000-0003-1610-0770].

The corpus is made up of 225 texts in Spanish annotated with complex words (CW). It contains three text types: consent forms (75 texts), clinical trial announcements (75 texts) and patient information documents (75 texts). This resource is aimed at training models, evaluating and performing experiments on complex word identification of Spanish medical texts.

The corpus contains three text types: 1. Consent forms (75 texts), 2. Clinical trial announcements (75 texts) y 3. Patient information leaflets (75 texts).

Themes
  • Science and technology
  • Healthcare
Tags
Creation date 2024-12-04T00:00:00
Last updated 2025-11-05T12:45:28
Refresh rate
Languages English
Geographic coverage Spain
Geographic coverage (International)
Time coverage
  • From 2024-01-01 to 2024-07-31
Effective resource
Related resources
Normative
    Institute
    Publisher Publicador - Digital.CSIC
    Observations

    Recommended citation : Ortega Riba, Federico; Campillos-Llanos, Leonardo; 2024; Corpus for Complex Word Identification in Medical Spanish Texts (CWI-Med-Sp) [Dataset]; DIGITAL.CSIC; https://doi.org/10.20350/digitalCSIC/16706