Action line:DataSense
Project Members : Danai Symeonidou (Post-Doc), Katerina Tzompanaki (Post-Doc), Thomas Rebele (Phd Student)
Coordinator :Fabian Suchanek (LTCI, Télécom ParisTech)
Subject : Information Extraction, knowledge linking, knowledge mining
Institutions :

DigiCosme Funding : 2014/2017

Scientific production :

  • In the frame of the PhD thesis of Thomas Rebele : Extending the YAGO knowledge base and Katerina Tzompanaki’s postdocship:
    • Thomas Rebele, Thomas Pellissier Tanon, Fabian M. Suchanek: “Bash Datalog: Answering Datalog Queries with Unix Shell Commands”, International Semantic Web Conference (ISWC), 2018
    • Thomas Rebele, Katerina Tzompanaki, Fabian M. Suchanek: “Adding Missing Words to Regular Expressions”, Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), 2018
    • Thomas Rebele, Arash Nekoei, Fabian M. Suchanek: “Using YAGO for the Humanities”, Workshop on Humanities in the Semantic Web (WHISE), 2017
    • Thomas Rebele, Katerina Tzompanaki, Fabian M. Suchanek: “Visualizing the addition of missing words to regular expressions”, International Semantic Web Conference (ISWC) demo track, 2017
    • Thomas Rebele, Fabian M. Suchanek, Johannes Hoffart, Joanna Asia Biega, Erdal Kuzey, Gerhard Weikum: “YAGO: a multilingual knowledge base from Wikipedia, Wordnet, and Geonames”, International Semantic Web Conference (ISWC) short paper track, 2016
    • Hiep Le, Thomas Rebele, Fabian M. Suchanek: “Open Digital Forms”, Theory and Practice of Digital Libraries (TPDL/ECDL) demo track, 2016
  • In the frame of the postdocship of Danai Symeonidou:
    • Danai Symeonidou, Luis Galárraga, Nathalie Pernelle, Fatiha Saïs, Fabian M. Suchanek: “VICKEY: Mining Conditional Keys on Knowledge Bases”, International Semantic Web Conference (ISWC), 2017
    • Ziad Ismail, Danai Symeonidou, Fabian M. Suchanek: “DIVINA: Discovering Vulnerabilities of Internet Accounts”, World Wide Web Conference (WWW) demo track, 2015
  • With project partners:
    • Fabian M. Suchanek, Colette Menard, Meghyn Bienvenu, Cyril Chapellier: “What if machines could be creative?”, International Semantic Web Conference (ISWC) demo track, 2016
    • Fabian M. Suchanek, Colette Menard, Meghyn Bienvenu, Cyril Chapellier: “Can you imagine… a language for combinatorial creativity?”, International Semantic Web Conference (ISWC), 2016
    • Fabian M. Suchanek, Gerhard Weikum: “Knowledge Representation in Entity-Centric Knowledge Bases”, RUSSIR Summer School invited paper, 2016
    • Gerhard Weikum, Johannes Hoffart, Fabian M. Suchanek: “Ten Years of Knowledge Harvesting: Lessons and Challenges”, Data Engineering Bulletin, 2016
    • Johannes Hoffart, Nicoleta Preda, Fabian M. Suchanek, Gerhard Weikum: “Knowledge Bases for Web Content Analytics”, World Wide Web (WWW) tutorial, 2015
    • David Montoya, Thomas Pellissier Tanon, Serge Abiteboul, Fabian M. Suchanek: “Thymeflow, a personal knowledge base with spatio-temporal data”, International Conference on Information and Knowledge Management (CIKM) demo track, 2016
    • Serge Abiteboul, Luna Dong, Oren Etzioni, Divesh Srivastava, Gerhard Weikum, Julia Stoyanovich, Fabian M. Suchanek: “The elephant in the room: getting value from Big Data”, Web and Databases (WebDB) short paper track, 2015
  • Other works:
    • Jérôme Dockès, Demian Wassermann, Russell Poldrack, Fabian M. Suchanek, Bertrand Thirion, Gaël Varoquaux: “Text to brain: predicting the spatial distribution of neuroimaging observations from text reports”, International Conference On Medical Image Computing and Computer Assisted Intervention (MICCAI), 2018
    • Arnaud Soulet, Arnaud Giacometti, Béatrice Markhoff, Fabian M. Suchanek: “Representativeness of Knowledge Bases with the Generalized Benford’s Law”, International Semantic Web Conference (ISWC), 2018
    • Thomas Pellissier Tanon, Marcos Dias de Assunção, Eddy Caron, Fabian M. Suchanek: “Demoing Platypus – A Multilingual Question Answering Platform for Wikidata”, Extended Semantic Web Conference (ESWC) demo track, 2018
    • David Montoya, Thomas Pellissier Tanon, Serge Abiteboul, Fabian M. Suchanek: “A Knowledge Base for Personal Information Management”, Linked Data on the Web workshop (LDOW), 2018
    • Jonathan Lajus, Fabian M. Suchanek: “Are All People Married? Determining Obligatory Attributes in Knowledge Bases”,Web Conference (WWW), 2018
    • Jérôme Dockès, Olivier Grisel, Joan Massich, Fabian M. Suchanek, Bertrand Thirion, Gaël Varoquaux: “Relating Brain Structures To Open-Ended Descriptions Of Cognition”, Conference on Cognitive Computational Neuroscience (CCN) short paper track, 2017
    • Fabian M. Suchanek: “Extraction d’informations”, Les Big Data à découvert, 2017
    • Luis Galárraga, Simon Razniewski, Antoine Amarilli, Fabian M. Suchanek: “Predicting Completeness in Knowledge Bases”, International Conference on Web Search and Data Mining (WSDM), 2017
    • Simon Razniewski, Fabian M. Suchanek, Werner Nutt: “But What Do We Actually Know?”,Automated Knowledge Base Construction (AKBC), 2016
    • Luis Galárraga, Christina Teflioudi, Katja Hose, Fabian M. Suchanek: “Fast Rule Mining in Ontological Knowledge Bases with AMIE+”, VLDB Journal (VLDBJ), 2015
    • Aliaksandr Talaika, Joanna Asia Biega, Antoine Amarilli, Fabian M. Suchanek: “IBEX: Harvesting Entities from the Web Using Unique Identifiers”, Web and Databases (WebDB), 2015

The Web provides a seemingly endless resource of information. The number of Web pages has recently trespassed one trillion, and its size is estimated to be several million terabytes. If a book contains 5 megabytes of data, we would have to print 200 billion books to store the Web – and build 6000 Libraries of Congress for them. And yet size is not everything. The crucial question is how useful that data is. And while the Web can answer many everyday questions already, it fails quickly in the face of more advanced information needs.

Complex information needs may appear in many different domains : researchers in the life sciences will want to find known proteins that inhibit atherosclerosis. Patent officers will want to find scientific papers that inspired a patent application. Economists will want to study effects of reforms (or their absence) on the economy of a country (or the morale of its inhabitants). These are information requirements for which the Web might contain answers. Yet, these answers are often hard, if not impossible to find. Thus, we find ourselves caught in a paradox: The amount of information on the Web is steadily growing, yet its usefulness is not.

The reason for this paradox is that the information on the Web mostly takes the form of natural language. Computers, however, are blind to the semantic 3 dimension of language. If computers are to help us find information, then that information has to be in a form that a computer can “understand”. That form is usually an ontology. An ontology, in its most general sense, is a structured collection of real-world knowledge with attached semantic rules. An ontology contains knowledge in such a form that a computer can answer queries on it, check it for consistency, and reason on it. Consequently, ontologies find applications in numerous domains, such as machine translation, word sense disambiguation, document classification, question answering, query expansion, and information integration.

Most notably, ontologies provide a way to attack the problem of “Big Data”: Data mining, visual analytics, and data-driven decision making systems can all benefit from semantic knowledge. Ontologies can provide,e.g., the categories for clustering, the dimensions for visualization, and the common sense for making decisions. Given the wealth of unstructured data on the Web on one hand, and the usefulness of ontologies on the other hand, recent years have seen considerable efforts to extract ontological knowledge from Web data.

With the Seda project, we propose to push this search even further. Our goal is to make even more information semantically accessible to machines. This goal can be pursued along numerous dimensions: Information Extraction, Crowd Sourcing, Data Mining, Text Annotation, Ontology Mapping, and Manual On- tology Design all come to mind. The Seda project has to concentrate on certain axes. The key insight that defines the thrust of Seda is the exploitation of existing structured data: Nowadays, structured information is already available in the form of the existing ontologies. Thus, the search for new structured knowledge can now be fuelled by the structured knowledge that is already available. This is the twist that guides the research in Seda. We will exploit this philosophy along 3 dimensions:

  • 1. By Information Extraction: The classical tool of information extraction can now be helped by the information that is already in structured form in existing ontologies.
  • 2. By knowledge linking: Information in the form of text, Web Services, or Web forms can now be semantically interpreted by linking it with an existing ontology.
  • 3. By knowledge mining: The information in an existing ontology can be used for reasoning, thus deriving new pieces of information from within the data.

In addition, Seda will conduct basic research on the foundations of ontological knowledge representation.

Fabian Suchanek Image Fabian M. Suchanek is a full professor at the Telecom ParisTech University in Paris.

He obtained his PhD at the Max-Planck Institute for Informatics under the supervision of G. Weikum. In his thesis, Fabian developed inter alia the YAGO-Ontology, one of the largest public ontologies, which earned him a honorable mention of the SIGMOD dissertation award.

Fabian Suchanek was a postdoc at Microsoft Research in Silicon Valley (reporting to R. Agrawal) and at INRIA Saclay/France (reporting to S. Abiteboul). <br /> He continued as the leader of the Otto Hahn Research Group « Ontologies » at the Max-Planck Institute for Informatics in Germany.<br /><br /> Fabian taught classes on the Semantic Web, Information Extraction and Knowledge Representation in France, in Germany, and in Senegal. With his students, he works on information extraction, rule mining, ontology matching, and other topics related to large knowledge bases. He has published around 70 scientific articles, among others at ISWC, VLDB, SIGMOD, WWW, CIKM, ICDE, and SIGIR, and his work has been cited more than 8000 times. He has won 3 best paper awards or runner ups and the ten year test of time award of the WWW conference.