SmartDataLake - Sustainable Data Lakes for Extreme-Scale Analytics
Data lakes are raw data ecosystems, where large amounts of diverse data are retained and coexist. They facilitate self-service analytics for flexible, fast, ad hoc decision making. SmartDataLake enables extreme-scale analytics over sustainable big data lakes. It provides an adaptive, scalable and elastic data lake management system that offers: (a) data virtualization for abstracting and optimizing access and queries over heterogeneous data, (b) data synopses for approximate query answering and analytics to enable interactive response times, and (c) automated placement of data in different storage tiers based on data characteristics and access patterns to reduce costs. The data lake’s contents are modelled and organised as a heterogeneous information network, containing multiple types of entities and relations. Efficient and scalable algorithms are provided for: (a) similarity search and exploration for discovering relevant information, (b) entity resolution and ranking for identifying and selecting important and representative entities across sources, (c) link prediction and clustering for unveiling hidden associations and patterns among entities, and (d) change detection and incremental update of analysis results to enable faster analysis of new data. Finally, interactive and scalable visual analytics are provided to include and empower the data scientist in the knowledge extraction loop. This includes functionalities for: (a) visually exploring and tuning the space of features, models and parameters, and (b) enabling large-scale visualizations of spatial, temporal and network data. The results of the project are evaluated in real-world use cases from the business intelligence domain, including scenarios for portfolio recommendation, production planning and pricing, and investment decision making. SmartDataLake will foster innovation and enable European SMEs to capitalize on the value of their own data lakes.
- AG Keim (Data Analysis and Visualization)
|(2020): explAIner : A Visual Analytics Framework for Interactive and Explainable Machine Learning IEEE Transactions on Visualization and Computer Graphics. Institute of Electrical and Electronics Engineers (IEEE). 2020, 26(1), pp. 1064-1074. ISSN 1077-2626. eISSN 1941-0506. Available under: doi: 10.1109/TVCG.2019.2934629||
We propose a framework for interactive and explainable machine learning that enables users to (1) understand machine learning models; (2) diagnose model limitations using different explainable AI methods; as well as (3) refine and optimize the models. Our framework combines an iterative XAI pipeline with eight global monitoring and steering mechanisms, including quality monitoring, provenance tracking, model comparison, and trust building. To operationalize the framework, we present explAIner, a visual analytics system for interactive and explainable machine learning that instantiates all phases of the suggested pipeline within the commonly used TensorBoard environment. We performed a user-study with nine participants across different expertise levels to examine their perception of our workflow and to collect suggestions to fill the gap between our system and framework. The evaluation confirms that our tightly integrated system leads to an informed machine learning process while disclosing opportunities for further extensions.
|(2020): v-plots : Designing Hybrid Charts for the Comparative Analysis of Data Distributions Computer Graphics Forum. Wiley. 2020, 39(3), pp. 565-577. ISSN 0167-7055. eISSN 1467-8659. Available under: doi: 10.1111/cgf.14002||
Comparing data distributions is a core focus in descriptive statistics, and part of most data analysis processes across disci-plines. In particular, comparing distributions entails numerous tasks, ranging from identifying global distribution properties,comparing aggregated statistics (e.g., mean values), to the local inspection of single cases. While various specialized visualiza-tions have been proposed (e.g., box plots, histograms, or violin plots), they are not usually designed to support more than a fewtasks, unless they are combined. In this paper, we present the v-plot designer; a technique for authoring custom hybrid charts,combining mirrored bar charts, difference encodings, and violin-style plots. v-plots are customizable and enable the simulta-neous comparison of data distributions on global, local, and aggregation levels. Our system design is grounded in an expertsurvey that compares and evaluates 20 common visualization techniques to derive guidelines for the task-driven selection ofappropriate visualizations. This knowledge externalization step allowed us to develop a guiding wizard that can tailor v-plotsto individual tasks and particular distribution properties. Finally, we confirm the usefulness of our system design and the user-guiding process by measuring the fitness for purpose and applicability in a second study with four domain and statistic expert
|(2020): Uncertainty-Aware Principal Component Analysis IEEE Transactions on Visualization and Computer Graphics. Institute of Electrical and Electronics Engineers (IEEE). 2020, 26(1), pp. 822-831. ISSN 1077-2626. eISSN 1941-0506. Available under: doi: 10.1109/TVCG.2019.2934812||
We present a technique to perform dimensionality reduction on data that is subject to uncertainty. Our method is a generalization of traditional principal component analysis (PCA) to multivariate probability distributions. In comparison to non-linear methods, linear dimensionality reduction techniques have the advantage that the characteristics of such probability distributions remain intact after projection. We derive a representation of the PCA sample covariance matrix that respects potential uncertainty in each of the inputs, building the mathematical foundation of our new method: uncertainty-aware PCA . In addition to the accuracy and performance gained by our approach over sampling-based strategies, our formulation allows us to perform sensitivity analysis with regard to the uncertainty in the data. For this, we propose factor traces as a novel visualization that enables to better understand the influence of uncertainty on the chosen principal components. We provide multiple examples of our technique using real-world datasets. As a special case, we show how to propagate multivariate normal distributions through PCA in closed form. Furthermore, we discuss extensions and limitations of our approach.
|Laufzeit:||01.01.2019 – 31.12.2021|