• français
    • English
  • français 
    • français
    • English
  • Login
JavaScript is disabled for your browser. Some features of this site may not work without it.
BIRD Home

Browse

This CollectionBy Issue DateAuthorsTitlesSubjectsJournals BIRDResearch centres & CollectionsBy Issue DateAuthorsTitlesSubjectsJournals

My Account

Login

Statistics

View Usage Statistics

Parametric schema inference for massive JSON datasets

Thumbnail
View/Open
Parametric_schema.pdf (234.0Kb)
Date
2019
Dewey
Programmation, logiciels, organisation des données
Sujet
JSON; Schema inference; Map-reduce; Spark; Big data collections
Journal issue
The VLDB Journal
Volume
28
Number
4
Publication date
08-2019
Article pages
497-521
Publisher
Springer
DOI
http://dx.doi.org/10.1007/s00778-018-0532-7
URI
https://basepub.dauphine.fr/handle/123456789/19935
Collections
  • LAMSADE : Publications
Metadata
Show full item record
Author
Baazizi, Mohamed-Amine
2544 Laboratoire de Recherche en Informatique [LRI]
Colazzo, Dario
989 Laboratoire d'analyse et modélisation de systèmes pour l'aide à la décision [LAMSADE]
Ghelli, Giorgio
87913 Dipartimento di Informatica [Pisa]
Sartiani, Carlo
262138 Dipartimento di Matematica Informatica ed Economia [DiMIE]
Type
Article accepté pour publication ou publié
Abstract (EN)
In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.

  • Accueil Bibliothèque
  • Site de l'Université Paris-Dauphine
  • Contact
SCD Paris Dauphine - Place du Maréchal de Lattre de Tassigny 75775 Paris Cedex 16

 Content on this site is licensed under a Creative Commons 2.0 France (CC BY-NC-ND 2.0) license.