DigiCosme Spring School 2015 on Data Management

Image

May 18th to 21 at ENSTA ParisTech, Palaiseau

May 22nd at Ecole Polytechnique

1. Schedule, Courses and Lecturers

Schedule

Monday 18
Massively Parallel Data Management
Tuesday 19
Workflows, Provenance, Reproducibility
Wednesday 20
Massively Parallel Data Management
Thursday 21
Database Kernels for Data Exploration
Friday 22
Social Data
09:00-10:30
Lecture 2 by
Sarah Cohen-Boulakia
09:00-10:30
Lecture 4 by
Volker Markl/Sebastian Schelter
09:00-10:30
Lecture 5 by
Stratos Idreos
09:00-10:30
Lecture 6 by
Bogdan Cautis
10:30-11:00
break
10:30-11:00
break
10:30-11:00
break
10:30-11:00
break
11:00-12:00
Lecture 2 by
Sarah Cohen-Boulakia
11:00-12:30
Lecture 4 by
Volker Markl/Sebastian Schelter
11:00-12:30
Lecture 5 by
Stratos Idreos
11:00-12:00
Lecture 6 by
Bogdan Cautis
13:00-13:30
welcome and coffee
12:30-13:30
lunch
12:30-14:00
lunch
12:30-14:00
lunch
lunch12:00-13:30
14:00-16:00
Lecture 1 by
Guido Moerkotte
13:30-16:00
Lecture 3 by
Dennis Shasha
14:00-16:00
Lab session on Flink
organized by V. Markl & Sebastian Schelter
Social event13:30-16:00
Lecture 7 by
Michalis Vazirgiannis
16:00-16:30
break
16:00-16:30
break
16:00-16:30
break
16:00-16:30
break
16:30-18:00
Lecture 1 by
Guido Moerkotte
16:30-18:00
Gong Show
(student presentations)
16:30-18:00
Gong Show
(student presentations)
Social event

Lecturers


Query Optimization

Guido Moerkotte studied computer science in Dortmund, Karlsruhe, and Amherst. After acquiring his doctoral degree he became a professor at the RWTH Aachen and then at the University of Mannheim.
His main research interest is query optimization.

Title : Plan Generation
Abstract: Every query optimizer contains a piece of code that is responsible for the enumeration of plan alternatives. Among these, the cheapest one is then selected for execution.
For different plan generation problems, we discuss several solutions and discuss their correctness and efficiency.
Date and Time : Monday, May 18th, 2014 – 2 pm to 6 pm
Material: Dynamic Programming Based Plan Generation


Workflows, Provenance, Reproducibility

Sarah Cohen-Boulakia is an Associate Professor in Computer Science in the Bioinformatics group at the University Paris Sud. Her major research interest is data integration in the Life Sciences, with a focus on Provenance in Scientific Workflows and Biological data Ranking. Since graduate school, she has been closely collaborating with biologists, physicians, and bioinformaticians in National, European and International projects. She has close collaborations with major International groups interested in Biological data integration including the Database group and Center for Bioinformatics at the University of Pennsylvania (USA), the Knowledge Management in Bioinformatics group at the Humboldt University (Berlin) and the INRIA Zenith group in Montpellier (France).

Title : Scientific Data Integration Workflows: Challenges and Opportunities
Abstract : Since the advent of high-throughput biology (e.g., the Human Genome Project), integrating the huge volumes of diverse biological data sets has been considered as one of the most important tasks for advancement in the biological sciences. Recent years have seen a shift in the understanding of what a “data integration system” for big biological data sets should do, revitalizing database research in this direction.

In this talk, we briefly review the past and current state of data integration for the Life Sciences and more deeply discuss recent trends. In particular, we focus on scientific workflow systems which provide an environment to guide the scientific discovery process from the design of the bioinformatics experiment to its full execution. While workflows describe the steps of experiments, workflow executions concretely depict the huge number of interlinked tools executed and the large volumes of data sets consumed and produced by each execution of step. Tasks of paramount importance for end-users include searching for similar workflows in workflows repositories, designing and refactoring workflows to make them easier to reuse, and comparing data obtained by different workflow executions at various levels of granularity. Providing solutions to such inherently complex graph-based problems is particularly challenging. In this context, we describe available approaches and underline opportunities of research for the database community.
Date and Time : Tuesday, May 19th, 2015 – 09 am to 12
Material: Scientific Data Integration Workflow


Dennis Shasha is a professor of computer science at the Courant Institute of New York University and an Associate Director of NYU Wireless. He works with biologists on pattern discovery for network inference; with computational chemists on algorithms for protein design; with physicists and financial people on algorithms for time series; on clocked computation for DNA computing; and on computational reproducibility. Other areas of interest include database tuning as well as tree and graph matching.

Because he likes to type, he has written six books of puzzles about a mathematical detective named Dr. Ecco, a biography about great computer scientists, and a book about the future of computing. He has also written five technical books about database tuning, biological pattern recognition, time series, DNA computing, resampling statistics, and causal inference in molecular networks. He has co-authored over seventy journal papers, seventy conference papers, and twenty patents. He has written the puzzle column for various publications including Scientific American, Dr. Dobb’s Journal, and the Communications of the ACM. He is a fellow of the ACM and an INRIA International Chair. More information

Title : Computational Reproducibility: Why Needed, First Tools, and Open Problems
Abstract : Ever since Francis Bacon, a hallmark of the scientific method has been that experiments should be described in enough detail that they can be repeated and perhaps generalized. When Newton said that he could see farther because he stood on the shoulders of giants, he depended on the truth of his predecessors’ observations and the correctness of their calculations. For science having computational components (the vast majority these days), this implies the possibility of repeating results on nominally equal configurations and then generalizing the results by replaying them on new data sets, and seeing how they vary with different parameters. Unfortunately, the state of the art falls far short of this goal. Most computational experiments are specified only informally in papers, where experimental results are briefly described in figure captions; the code that produced the results is seldom available; and configuration parameters change results in unforeseen ways.

This has serious implications. There have been several instances in the recent past of mistakes discovered in papers appearing in the most prominent scientific journals, some of which influenced government policy. Supporting reproducibility gives other benefits to groups whose members come and go, whose machines change and which would like to benefit from the results of others even when performed on different hardware and operating systems.

After defining the goal of reproducibility and its benefits, this talk discusses tools and open research problems.
Date and Time: Tuesday, May 19th, 2015 – 1.30 pm to 4 pm
Material: Computational Reproducibility: Why Needed, First Tools, Open Problems


Massively Parallel Data Management

Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) group at the Technische Universität Berlin (TU Berlin). Volker also holds a position as an adjunct full professor at the University of Toronto and is director of the research group “Intelligent Analysis of Mass Data” at DFKI, the German Research Center for Artificial Intelligence. Earlier in his career, Dr. Markl lead a research group at FORWISS, the Bavarian Research Center for Knowledge-based Systems in Munich, Germany, and was a Research Staff member & Project Leader at the IBM Almaden Research Center in San Jose, California, USA. His research interests include: new hardware architectures for information management, scalable processing and optimization of declarative data analysis programs, and scalable data science, including graph and text mining, and scalable machine learning.

Sebastian Schelter is a PhD student at the Database Systems and Information Management Group (DIMA) of TU Berlin with Prof. Volker Markl. His research aims at improving the technology for performing large scale data analysis on parallel processing platforms. Use case-wise, his focus is on enabling Collaborative Filtering with billions of interactions and Graph Mining on graphs with billions of vertices and edges. Furthermore, he is part of the developing team of Apache Flink (formerly Stratosphere), a database-inspired parallel processing stack for large scale analytics. The system is part of a research project lead by our group at TU Berlin. Sebastian Schelter is also engaged in Open Source as a member of the Apache Software Foundation, where he is a committer and PMC member in the Mahout, Giraph and Flink projects. Click here for an overview of his research work.

Title: Big Data Analytics – Challenges, Opportunities and an Introduction in Apache Flink
Abstract: Data management research, systems, and technologies have drastically improved the availability of data analysis capabilities, particularly for non-experts, due in part to low-entry barriers and reduced ownership costs (e.g., for data management infrastructures and applications). Major reasons for the widespread success of database systems and today’s multi-billion dollar data management market include data independence, separating physical representation and storage from the actual information, and declarative languages, separating the program specification from its intended execution environment.

In contrast, today’s big data solutions do not offer data independence and declarative specification. As a result, big data technologies are mostly employed in newly-established companies with IT-savvy employees or in large well-established companies with big IT departments. We argue that current big data solutions will continue to fall short of widespread adoption, due to usability problems, despite the fact that in-situ data analytics technologies achieve a good degree of schema independence. In particular, we consider the lack of a declarative specification to be a major road-block, contributing to the scarcity in available data scientists available and limiting the application of big data to the IT-savvy industries. In particular, data scientists currently have to spend a lot of time on tuning their data analysis programs for specific data characteristics and a specific execution environment.

We believe that the research com-munity needs to bring the powerful concepts of declarative specification to current data analysis systems, in order to achieve the broad big data technology adoption and effectively deliver the promise that novel big data technologies offer. We will also discuss recent research contributions to the field of big data analytics by our group at TU Berlin and will introduce the Berlin Big Data Center. We will also give an introduction into Apache Flink, which includes the architecture, basic programming model, examples, and advanced concepts, such as iterations and automatic optimization.
Date and Time: Wednesday, May 20th, 2015 – 09 am to 12.30 + lab session – 2.00 am to 4 pm
Material:
Big Data Management And Apache Flink, by V. Markl
Massively Parallel Data Management, by S. Schelter


Database Kernels for Data Exploration

Stratos Idreos is an assistant professor of Computer Science at the Harvard School of Engineering and Applied Sciences. Stratos works in the area of data management with emphasis on designing systems for big data exploration. Stratos obtained his Ph.D. from University of Amsterdam in the Netherlands. Before joining Harvard he spent 3 years as a tenure-track Scientific Staff Member with the Dutch National Research Center for Mathematics and Computer Science and held research internship and visiting scholar positions with Microsoft Research, Redmond USA, with EPFL, Switzerland and with IBM Almaden USA. For his doctoral work on Database Cracking, Stratos won the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award which recognises the best thesis internationally in the field of data management. In addition, he won the 2011 ERCIM Cor Baayen award as « most promising European young researcher in computer science and applied mathematics » from the European Research Council on Informatics and Mathematics. In 2010 he was awarded the IBM zEnterpise System Recognition Award by IBM Research, while in 2011 he also won the Challenges and Visions best paper award in the 2011 International Conference on Very Large Databases.

Title: Database Kernels for Data Exploration
Abstract: How far away are we from a future where a data management system sits in the critical path of everything we do? Already today we need to go through a data system in order to do several basic tasks, e.g., to pay at the grocery store, to book a flight, to find out where our friends are and even to get coffee. Businesses and sciences are increasingly recognizing the value of storing and analyzing vast amounts of data. Other than the expected path towards an exploding number of data-driven businesses and scientific scenarios in the next few years, in this talk we also envision a future where data becomes readily available and its power can be harnessed by everyone. What both scenarios have in common is a need for new kinds of data systems which are tailored for data exploration, which are easy to use, and which can quickly absorb and adjust to new data and access patterns on-the-fly. We will discuss this vision as well as recent and ongoing advances towards data systems which are tailored for data exploration, specifically adaptive indexing, adaptive loading, adaptive layouts and gesture based data systems.
Date and Time: Thursday, 9.00 am-12.30
Material :
Database Kernels For Data Exploration (short Version)
Database Kernels For Data Exploration (long Version With Steps)


Social Data

Bogdan Cautis
Bogdan Cautis is a Professor of Computer Science at Paris-Sud University since October 2013. He received his Habilitation (HdR) from Pierre et Marie Curie University in March 2012 and his Ph.D. in September 2007 from Paris Sud University, working in the Gemo research team of INRIA Futurs, advised by INRIA DR Serge Abiteboul and Tova Milo from Tel Aviv University. He also got a MSc degree from École Polytechnique and engineering degrees in Computer Science from Politehnica University of Bucarest, Romania and from École Polytechnique, France. His research interests lie in the broad area of Web data management and information retrieval, focusing recently on problems related to social media and user-centric applications.

Title : Social Data Management
Abstract :
This talk will give of overview on interesting, challenging and practical problems involving social networks and the Web, focusing on answering information needs in search and recommendation scenarios. In social media, the information and the users querying them are no longer decoupled, but attached by countless visible (explicit) and invisible (implicit) ties. How to address the overarching goals of data management — while integrating concepts such as social entities / relationships and crucial ingredients pertaining to social relevance / bias, social affinity, information propagation, friend-sourcing, social profiles, user feedback, communities — is a broad area of innovation for researchers and practitioners. For brief illustration, when searching for micro-blogging posts bearing certain hash-tags or named entities, the existing social links should be used to bias the results, following the intuition that one’s interests are often correlated to those of socially close or similar users. This could be seen as a form of personalization or customization, which may leads to different answers depending on who is asking the question (the seeker).
Date and Time: Friday, 9.00 am to 12.00
Material : TBA


Michalis Vazirgiannis is a Professor in LIX, École Polytechnique. He has worked as a researcher in the different places: in the Knowledge & DB Lab (group, N.T.U. Athens), in GMD-IPSI (currently Frauhofer – IPSI), Germany, in Fern-Universitaet Hagen, in project VERSO (later GEMO) in INRIA/Paris, in IBM India Research Laboratory and in Max Planck Instistut fur Informatik (Saarbruecken, Germany) in the group of G. Weikum. M. Vazirgiannis held a Marie Curie Intra-European fellow (2006-2007) in area of « P2P Web Search », hosted by INRIA FUTURS. His current research interests are in the area of bigdata mining – aiming at harnessing the potential of machine learning algorithms for large scale data sets including text and graphs. More specifically his current work is on graph degeneracy for large scale graph mining, graph based text retrieval, learning models from time series data and text mining for the web (i.e. advertising, news streams).

Title: Graph Degeneracy and its implications on graph mining
Date and Time: Friday 22nd, 1.30pm to 4pm
Material : TBA


1. Practical information

Address and directions

From Paris Airports

  • From Orly Airport (about 30 min):
    • take the ORLYVAL to Antony,
    • then the RER line B, direction Saint-Rémy-les-Chevreuse.
  • From Roissy Charles de Gaulle Airport (about 1 h):
    • take the RER line B direction Saint-Rémy-les-Chevreuse.

For the rest of the journey, see access by public transportation:

To ENSTA by public transportation

ENSTA 828, Boulevard des Maréchaux, 91762 Palaiseau Cedex
GPS : 48.711042, 2.219278

  • 1st solution
    • Take the RER line B direction Saint-Rémy-les-Chevreuse. Stop at Massy-Palaiseau.
    • Take the bus, line 91-06, direction Saint- Quentin. Stop at « ENSTA, les Joncherettes ».
  • 2nd solution (for tough walkers)
    • Take the RER line B direction Saint-Rémy-les-Chevreuse. Stop at Lozère station.
    • 10 minutes walk to ENSTA, starting with steep stairs

To École Polytechnique: Friday, May 22

ECOLE POLYTECHNIQUE Route de Saclay – 91128 PALAISEAU
GPS : +48° 42′ 51.00″, +2° 12′ 09.00″
The last day of the School will take place at École Polytechnique. The building is located in
the same area as ENSTA. Directions are the same but it is one bus station farther:

  • Take the bus, line 91-06, direction Saint- Quentin. Stop at « Lozère » or « Laboratoire ».

Schedule here


Lab session

In order to access to- and work with the github project your computer will need to be equipped with:

  • Java 7 (preferably Oracle JDK 7)
  • maven
  • git
  • a Java IDE (preferably IntelliJ IDEA community edition)

Organizers

Ioana Manolescu, INRIA, France
Pierre Sennelart, Télécom ParisTech
Isabelle Glas, Labex DigiCosme, France
Contact :contact@digicosme.fr