Data series are one of the most common data types, and are present in virtually every scientific and social domain. Data series analytics (e.g., clustering, classification, frequent patterns, outlier detection, etc.) represents an important challenge for very large collections. Previous works have demonstrated that indexing techniques enable scalable analytics by providing fast similarity search. Nevertheless, these techniques only work for a fixed, predetermined length for the indexed data series and the queries, which is a major shortcoming. In this work, we remove this constraint, and present the first index that can efficiently support queries of varying length. The proposed index works both for data series that are normalized and non-normalized. We address two major challenges. First, we show how we can effectively use the information that already resides within traditional indexes, in order to answer queries of varying length without increasing the size of the index. Second, we provide an efficient method for dealing with normalized data series by grouping neighboring subsequences under a common representation, which leads to a small index footprint and fast query answering times. The empirical evaluation of the proposed technique demonstrates the effectiveness and efficiency of our solution. Apart from the poster, which will present the theoretical background of our approach, we will also present a prototype system that implements the proposed approach. This system allows efficient exploration of big data series collections. Users can pose queries using their mouse (or touch screen), or select queries from a predefined set. The system can execute queries of varying lengths on large multi-gigabyte datasets in seconds, using a commodity laptop.
During our presentation/demonstration we may be able to show some relevant data and results from our collaboration with EDF (group of Dr. Georges Hebrail).