Data Analytics, Vertiefungsstudium, Internet
Prof. Dr. Christian Bizer
Mannheim (ggf. Online)
1. Termin: 14.03.2022 - 24.04.2022
2. Termin: 13.03.2023 - 30.04.2023
The course Web Data Integration covers advanced techniques for integrating data from multiple sources in the context of the World Wide Web as well as within enterprise settings.
The course covers the following topics and enables students to reach the following learning goals:
- Introduction to Web Data Integration: The course starts with a general overview of the topic of data integration which covers common data integration scenarios, the different types of heterogeneity that need to be bridged in data integration, the general data integration process, as well as the principled architectures of data integration systems. Students will understand the challenges of data integration and are able to put these challenges into context.
- Types of Structured Data on the Web: Beside of free-text, the Web contains a lot of structured data. Examples of websites that provide structured data include data catalogs, which offer data dumps in various formats for download; as well as Web APIs, which allow specific, restricted queries to be asked against Web data sources. A lot of structured data is also found in HTML documents: Beside of HTML tables which might contain structured data, more and more websites have started to use markup languages such as Microdata and RDFa to annotate structured data like addresses, product descriptions, or reviews in their pages. In addition, various data providers in government, publishing and libraries, as well as research use Linked Data technologies to publish structured data on the Web and set data links between data sources in order to ease integration. Students will have an understanding of the different types of structured data that are available on the Web and know which techniques are commonly used to publish data for a specific application or topical domain.
- Data Exchange Formats: Whenever systems communicate over the Web, they transfer data using data exchange formats. Commonly used data exchange formats include XML, CSV, JSON, and RDF. This topic covers the principle structure and syntax of these formats as well as their benefits and drawbacks. You will gain the ability to load and/or query data that is represented using these formats from within applications that you develop.
- Schema Mapping and Data Translation: Different data sources use different schemata to represent data. The goal of schema mapping is to align heterogeneous schemata. This alignment is achieved either by mapping a set of source schemata to a given target schema, for example in the case of integrating new data sources into an existing system, or by creating a new target schema which is capable to present all data from a given set of data sources. Such mappings consist of correspondences between elements in the source and target schema. The correspondences are used afterward to translate data that is represented using a source schema into the target schema. Students will be enabled to distinguish different schema mapping scenarios. It teaches them how to define correspondences between schemata and gives an overview of schema matching techniques for automating the process of finding correspondences between schemata.
- Identity Resolution: The same real-world entity, e.g. a person, a product, or a geographic location, is often described by multiple data sources. The goal of identity resolution is to determine all records in all data sources that describe the same real-world entity. Students will learn how to apply domain-specific similarity measures in order to find records in different data sources that describe the same entity. They will also learn to apply blocking techniques in order to deal with the quadratic complexity of the record matching task.
- Data Quality and Data Fusion: The last step in the data integration process, after schema-level and instance-level correspondences have been found, is to combine data from different sources in order to generate an integrated data set. This step is called data fusion and tries to achieve two goals in parallel: 1. The created data set should be as complete as possible, meaning that all attributes of the target schema should be filled with values for all entities. 2. The quality of data values in the created data set should fulfill the user’s needs, meaning in most cases that the data should reflect the real-world as closely as possible. Students will develop an understanding of the different dimensions of data quality as well as their relevancy in different application contexts; they will learn how to assess data quality; and they will learn how to resolve conflicts between data from different sources by applying data fusion methods.
Students will be able select and apply appropriate techniques for integrating and cleansing enterprise as well as Web data. Participants will acquire knowledge of the data integration process as well as the techniques that are used in each phase of the process.
Students learn to apply data integration techniques in business scenarios.
Students learn to work as a team in ordert o suceed in a data integration project (case study).
Die Studierenden erarbeiten sich den Inhalt selbständig anhand von Studienbriefen.
Klausur (60 Minuten) + Projektarbeit
Melden Sie sich jetzt zu einem Einzel- oder Gesamtzertifikat unserer Zertifikatsprogramme an. Damit wir sicherstellen können, dass Sie Ihre Zugangsdaten auch rechtzeitig erhalten, muss die Anmeldung spätestens zwei Wochen vor Beginn des (ersten) Moduls erfolgen.
Hinweis: Die Teilnehmerzahl ist im Interesse der Teilnehmer begrenzt, so dass wir eine frühzeitige, verbindliche Anmeldung empfehlen.
Sollten Sie noch Fragen haben, können Sie gerne jederzeit Kontakt mit uns aufnehmen.