As we reported last week, CSE professors and students played an important role at ACM SIGMOD 2017. It's considered one of the two premier venues for data management researchers worldwide. Now faculty affiliated with CSE's Database Lab are preparing for the other such venue: the 43rd International Conference on Very Large Data Bases (VLDB). The five-day conference is slated to begin on August 28 at the Technical University of Munich in Bavaria (Germany), and CSE will be represented by four regular research papers accepted for presentation at VLDB 2017.
Ahead of the conference, the CSE Data Base Seminar Series scheduled its final seminar of the Spring 2017 quarter on Friday, June 9. Rather than featuring one topic, CSE Prof. Victor Vianu invited three Ph.D. students to preview the papers they'll be presenting at VLDB 2017. All three are first authors on their respective paper accepted to the conference. According to Vianu, the talks were "dry runs" ahead of the formal presentations at VLDB in Munich. All three are advised by Prof. Yannis Papakonstantinou, and both Jianguo Wang and Jing Li are co-advised by Prof. Steven Swanson.
MILC: Inverted List Compression in Memory Presenter: Jianguo Wang
Abstract: Inverted list compression is a topic that has been studied for 50 years due to its fundamental importance in numerous applications including information retrieval and databases. Typically, an inverted list compression algorithm is evaluated on its pace overhead and query processing time. Earlier list compression designs mainly focused on minimizing the space overhead to reduce expensive disk I/O time in disk-oriented systems. But the recent trend is towards reducing query processing time because the underlying systems tend to be memory-resident. Although there are many highly optimized compression approaches in main memory, there is still a considerable performance gap between query processing over compressed lists and uncompressed lists, which motivates this work. In this work, we set out to bridge this performance gap for the first time by proposing a new compression scheme, namely, MILC (memory inverted list compression). MILC relies on a series of techniques including fixed-bit encoding, dynamic partitioning, in-block compression, cache-aware optimization, and SIMD acceleration. We conduct experiments on two real-life datasets to demonstrate the high performance and low space overhead of MILC. We compare MILC with 12 recent compression algorithms and experimentally show that MILC improves the query performance by up to 13.2X and reduces the space overhead by up to 3.7X. In particular, compared with uncompressed lists, MILC achieves a similar query performance but with a 3.7X to 4.7X lower space overhead. Compared with widely used compression algorithms, e.g., PforDelta, MILC is 5.65X to 5.75X faster and takes less space overhead. (By Jianguo Wang, Chunbin Lin, Ruining He, Moojin Chae, Yannis Papakonstantinou and Steven Swanson,)
HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data Analytics Presenter: Jing Li
Abstract: As data sets grow and conventional processor performance scaling slows, data analytics move towards heterogeneous architectures that incorporate hardware accelerators (notably GPUs) to continue scaling performance. However, existing GPU-based databases fail to deal with big data applications efficiently: their execution model suffers from scalability limitations on GPUs whose memory capacity is limited; existing systems fail to consider the discrepancy between fast GPUs and slow storage, which can
counteract the benefit of GPU accelerators.
In this paper, we propose HippogriffDB, an efficient, scalable GPU-accelerated OLAP system. It tackles the bandwidth discrepancy using compression and an optimized data transfer path. HippogriffDB stores tables in a compressed format and uses the GPU for decompression, trading GPU cycles for the improved I/O bandwidth. To improve the data transfer efficiency, HippogriffDB introduces a peer-to-peer, multi-threaded data transfer mechanism, directly transferring data from the SSD to the GPU. HippogriffDB adopts a query-over-block execution model that provides scalability using a stream-based approach. The model improves kernel efficiency with the operator fusion and double buffering mechanism. We have implemented HippogriffDB using an NVMe SSD, which talks directly to a commercial GPU. Results on two popular benchmarks demonstrate its scalability and efficiency. HippogriffDB outperforms existing GPU-based databases (YDB) and in-memory data analytics (MonetDB) by 1-2 orders of magnitude. (By Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou and Steven Swanson.)
Fast In-Memory SQL Analytics on Typed Graphs Presenter: Chunbin Lin
Abstract: We study a class of graph analytics SQL queries, which we call relationship queries. These queries involving aggregation, join, semijoin, intersection and selection are a wide superset of fixed length graph reachability queries and of tree pattern queries. We present real-world OLAP scenarios, where efficient relationship queries are needed. However, row stores, column stores, and graph databases are unacceptably slow in such OLAP scenarios.
In this paper, we propose a GQ-Fast database, which is an indexed database that roughly corresponds to efficient encoding of annotated adjacency lists that combines salient features of column-based organization, indexing, and compression. GQ-Fast uses a bottom-up, fully pipelined query execution model, which enables (a) aggressive compression (e.g., compressed bitmaps and Huffman) and (b) avoids intermediate results that consist of row IDs (which are typical in column databases). GQ-Fast compiles query plans into executable C++ source code. Besides achieving runtime efficiency, GQ-Fast also reduces main memory requirements because, unlike column databases, GQ-Fast selectively allows dense forms of compression including heavy-weight compressions, which do not support random access.
We used GQ-Fast to accelerate queries for two OLAP dashboards in the biomedical field. GQ-Fast outperforms PostgreSQL by 2–4 orders of magnitude and MonetDB, Vertica, and Neo4j by 1–3 orders of magnitude when all of them are running on RAM. Our experiments dissect GQ-Fast’s advantage between (i) the use of compiled code, (ii) the bottom-up pipelining execution strategy, and (iii) the use of dense structures. Other analysis and experiments show the space savings of GQ-Fast due to the appropriate use of compression methods. We also show that the runtime penalty incurred by the dense compression methods decreases as the number of CPU cores increases. (By Chunbin Lin, Benjamin Mandel, Yannis Papakonstantinou and Matthias Springer.)
In addition to the papers previewed above, a fourth paper was accepted into the Research Track at VLDB 2017: Towards Linear Algebra over Normalized Data, by CSE professor Arun Kumar and his co-authors from the University of Wisconsin-Madison (where Kumar recently completed his Ph.D.). His Wisconsin co-authors included Ph.D. student Lingjiao Chen and professors Jeffrey Naughton, and Jignesh Patel.
It is anticipated that UC San Diego and CSE will be represented elsewhere on the program in Munich, either in connection with tutorials, or nearly 10 workshops scheduled for the first and last days of VLDB. One of those co-located workshops also features a program committee including a UC San Diego voice: that of San Diego Supercomputer Center research scientist Chaitan Baru, who sits on the program committee of the Conference on Performance Evaluation and Benchmarking (TPCTC). But to a large extent, CSE involvement at VLDB will fall short of the department's visibility at SIGMOD, because VLDB is primarily an international conference at an international venue.