• français
    • English
  • English 
    • français
    • English
  • Login
JavaScript is disabled for your browser. Some features of this site may not work without it.
BIRD Home

Browse

This CollectionBy Issue DateAuthorsTitlesSubjectsJournals BIRDResearch centres & CollectionsBy Issue DateAuthorsTitlesSubjectsJournals

My Account

Login

Statistics

View Usage Statistics

SCALPEL3: a scalable open-source library for healthcare claims databases

Thumbnail
View/Open
1910.07045.pdf (1.383Mb)
Date
2019
Publisher city
Paris
Collection title
Cahier de recherche CEREMADE, Université Paris Dauphine-PSL
Link to item file
https://arxiv.org/abs/1910.07045
Dewey
Traitement du signal
Sujet
Healthcare claims data; ETL; Large observational database; Concept extraction; Scalability; Reproducibility; Interactive data manipulation
URI
https://basepub.dauphine.fr/handle/123456789/20688
Collections
  • CEREMADE : Publications
Metadata
Show full item record
Author
Bacry, Emmanuel
60 CEntre de REcherches en MAthématiques de la DEcision [CEREMADE]
Gaiffas, Stéphane
542130 Laboratoire de Probabilités, Statistiques et Modélisations [LPSM (UMR_8001)]
Leroy, Fanny
162303 Caisse Nationale d'Assurance Maladie
Morel, Maryan
89626 Centre de Mathématiques Appliquées - Ecole Polytechnique [CMAP]
Nguyen, Dinh Phong
89626 Centre de Mathématiques Appliquées - Ecole Polytechnique [CMAP]
Sebiat, Youcef
89626 Centre de Mathématiques Appliquées - Ecole Polytechnique [CMAP]
Sun, Dian
89626 Centre de Mathématiques Appliquées - Ecole Polytechnique [CMAP]
Type
Document de travail / Working paper
Item number of pages
14
Abstract (EN)
This article introduces SCALPEL3, a scalable open-source framework for studies involving Large Observational Databases (LODs). Its design eases medical observational studies thanks to abstractions allowing concept extraction, high-level cohort manipulation, and production of data formats compatible with machine learning libraries. SCALPEL3 has successfully been used on the SNDS database (see Tuppin et al. (2017)), a huge healthcare claims database that handles the reimbursement of almost all French citizens.SCALPEL3 focuses on scalability, easy interactive analysis and helpers for data flow analysis to accelerate studies performed on LODs. It consists of three open-source libraries based on Apache Spark. SCALPEL-Flattening allows denormalization of the LOD (only SNDS for now) by joining tables sequentially in a big table. SCALPEL-Extraction provides fast concept extraction from a big table such as the one produced by SCALPEL-Flattening. Finally, SCALPEL-Analysis allows interactive cohort manipulations, monitoring statistics of cohort flows and building datasets to be used with machine learning libraries. The first two provide a Scala API while the last one provides a Python API that can be used in an interactive environment. Our code is available on GitHub.SCALPEL3 allowed to extract successfully complex concepts for studies such as Morel et al (2017) or studies with 14.5 million patients observed over three years (corresponding to more than 15 billion healthcare events and roughly 15 TeraBytes of data) in less than 49 minutes on a small 15 nodes HDFS cluster. SCALPEL3 provides a sharp interactive control of data processing through legible code, which helps to build studies with full reproducibility, leading to improved maintainability and audit of studies performed on LODs.

  • Accueil Bibliothèque
  • Site de l'Université Paris-Dauphine
  • Contact
SCD Paris Dauphine - Place du Maréchal de Lattre de Tassigny 75775 Paris Cedex 16

 Content on this site is licensed under a Creative Commons 2.0 France (CC BY-NC-ND 2.0) license.