Workshop on Distributed and Stream Data Processing and ML
Friday, March 22nd, 08:30 – 12:30
Smart Data Forum, Salzufer 6, Eingang Otto-Dibelius-Strasse, 10587 Berlin
08:30 Meet and coffee
09:00 Introduction by Volker Markl
09:30 Talk by Albert Bifet "Machine Learning for Data Streams" and Discussion
10:30 Talk by Amr El Abbadi "The Cloud, the Edge and Blockchains: Unifying Themes and Challenges" and Discussion
11:30 Talk by Seif Haridi and Paris Carbone "From Stream Processing to Continuous and Deep Analytics" and Discussion
Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin). At the German Research Center for Artificial Intelligence (DFKI), he is both a Chief Scientist and Head of the Intelligent Analytics for Massive Data Research Group. In addition, he is Director of the Berlin Big Data Center (BBDC) and Co-Director of the Berlin Machine Learning Center (BzMl). Earlier in his career, he was a Research Staff Member and Project Leader at the IBM Almaden Research Center in San Jose, California, USA and a Research Group Leader at FORWISS, the Bavarian Research Center for Knowledge-based Systems located in Munich, Germany. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 18 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been both the Speaker and Principal Investigator for the Stratosphere Project, which resulted in a Humboldt Innovation Award as well as Apache Flink, the open-source big data analytics system. He serves as the President-Elect of the VLDB Endowment and was elected as one of Germany's leading Digital Minds (Digitale Köpfe) by the German Informatics (GI) Society. Most recently, Volker and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on "Implicit Parallelism Through Deep Language Embedding." Volker Markl and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on implicit parallelism through deep language embedding.
Abstract: Big Data and the Internet of Things (IoT) have the potential to fundamentally shift the way we interact with our surroundings. The challenge of deriving insights from the Internet of Things (IoT) has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors and devices is bound to become a key area of data mining research as the number of applications requiring such processing increases. Dealing with the evolution over time of such data streams, i.e., with concepts that drift or change completely, is one of the core issues in stream mining. In this talk, I will present an overview of data stream mining, and I will introduce some popular open source tools for data stream mining.
Bio: Albert Bifet is Professor at Telecom ParisTech, Head of the Data, Intelligence and Graphs (DIG) Group, and Honorary Research Associate at the WEKA Machine Learning Group at University of Waikato. Previously he worked at Huawei Noah's Ark Lab in Hong Kong, Yahoo Labs in Barcelona, University of Waikato and UPC BarcelonaTech. He is the co-author of a book on Machine Learning from Data Streams. He is one of the leaders of MOA and Apache SAMOA software environments for implementing algorithms and running experiments for online learning from evolving data streams. He was serving as Co-Chair of the Industrial track of IEEE MDM 2016, ECML PKDD 2015, and as Co-Chair of BigMine (2018-2012), and A CM SAC Data Streams Track (2019-2012)
Abstract: Significant paradigm shifts are occurring in the way data is accessed and updated. Data is “very big” and distributed across the globe. Access patterns are widely dispersed and large scale analysis requires real- time responses. Many of the fundamental challenges have been studied and explored by both the distributed systems and the database communities for decades. However, the current changing and scalable setting often requires a rethinking of basic assumptions and premises. The rise of the cloud computing paradigm with its global reach has resulted in novel approaches to integrate traditional concepts in novel guises to solve fault-tolerance and scalability challenges. This is especially the case when users require real-time global access. Exploiting edge cloud resources becomes critical for improved performance, which requires a reevaluation of many paradigms, even for a traditional problem like caching. The need for transparency and accessibility has led to innovative ways for managing large scale replicated logs and ledgers, giving rise to blockchains and their many applications. In this talk we will be explore some of these new trends while emphasizing the novel challenges they raise from both distributed systems as well as database points of view. We will propose a unifying framework for traditional consensus and commitment protocols, and discuss novel protocols that exploit edge computing resources to enhance performance. We will highlight the advantages and discuss the limitations of blockchains. Our overall goal is to explore approaches that unite and exploit many of the significant efforts made in distributed systems and databases to address the novel and pressing needs of today’s global computing infrastructure.
Bio: Amr El Abbadi is a Professor of Computer Science at the University of California, Santa Barbara. He received his B. Eng. from Alexandria University, Egypt, and his Ph.D. from Cornell University. Prof. El Abbadi is an ACM Fellow, AAAS Fellow, and IEEE Fellow. He was Chair of the Computer Science Department at UCSB from 2007 to 2011. He has served as a journal editor for several database journals, including, The VLDB Journal, IEEE Transactions on Computers and The Computer Journal. He has been Program Chair for multiple database and distributed systems conferences. He currently serves on the executive committee of the IEEE Technical Committee on Data Engineering (TCDE) and was a board member of the VLDB Endowment from 2002 to 2008. In 2007, Prof. El Abbadi received the UCSB Senate Outstanding Mentorship Award for his excellence in mentoring graduate students. In 2013, his student, Sudipto Das received the SIGMOD Jim Gray Doctoral Dissertation Award. Prof. El Abbadi is also a co-recipient of the Test of Time Award at EDBT/ICDT 2015. He has published over 300 articles in databases and distributed systems and has supervised over 35 PhD students.
Abstract: Contemporary end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tensors and graphs. For each of these types of workloads exist several frontends (e.g., SQL, Beam, Tensorflow etc.) exposed in different programming languages as well as different runtimes (e.g., Spark, Flink, Tensorflow) that optimise for a respective frontend and possibly a hardware architecture (e.g., GPUs). The resulting pipelines suffer in terms of complexity and performance due to excessive type conversions, materialization of intermediate results and lack of cross-frontend computation sharing capabilities.
In this talk we present the Continuous Deep Analytics (CDA) project, the core principles behind it and our past work that influenced to its conception. CDA aims to provide a unified approach to declare and execute analytical tasks across frontend-boundaries as well as enabling their seemless integration with continuous services, streams and data-driven applications at scale. The system achieves that through Arc, an intermediate language that captures batch and stream analytics as well as a sophisticated distributed runtime that combines and augments existing ideas from stream processing, in-memory databases and cluster computing.
Bio:Paris Carbone is a senior researcher at the Swedish Institute of Computer Science (part of RISE). He holds a PhD in distributed computing from KTH and is one of the core committers for Apache Flink with key contributions to its state management. Paris is currently leading the Distributed Computing & Data Science research group at SICS whose interests span several domains of computer science from distributed algorithms and data management to declarative programming support for data analytics and ML.
Bio: Seif Haridi is the Chief Scientific Advisor of RISE SICS. He is Chair-Professor of Computer Systems specialized in parallel and distributed computing at KTH Royal Institute of Technology, Stockholm, Sweden. He led a European research program on Cloud Computing and Big Data by EIT-Digital between 2010 to 2013, and is a co-founder of a number of start-ups in the area of distributed and cloud computing including HiveStreaming and LogicalClocks.Recent research include contributions to the design of Apache Flink for stream processing, and HOPS a complete platform for data-analytics
To register for the workshop please, fill in the form below
We collect your "full name" and "your email address", to have an overview how many people will attend the workshop, to carry out the workshop, and for TU Berlin intern administrative proccesses. We store the data in our intern storage systems (TU Berlin systems) and delete them, when the completion of all processes regarding the workshop is over. Data will not be passed to third parties. You can revoke your consent at any time by email to firstname.lastname@example.org.