Data analytics researchers like to focus on new engines and algorithms. But most data analytics organizations have a different and more fundamental problem: their analysts operate in a desperately information-poor environment. We are missing so much rich contextual metadata in our projects: what data we have, why we have it, how and by whom it gets used, and how all these aspects evolve over time. One problem is that we haven’t been capturing and recording this information. A second is that we have yet to apply data science to the behavior of data scientists. It is time to get serious about capturing, recording, and analyzing the work people do with data and computation and the contextual human knowledge they bring to those tasks. This data context problem raises interesting systems challenges while also suggesting opportunities for new applications and algorithms to significantly improve the efficiency of data analysts.
Halevy, Korn, Noy, et al. Goods: Organizing Google’s Datasets. In SIGMOD 2016.
Hellerstein, Sreekanti, Gonzalez, et al. 2016. Establishing Common Ground with Data Context. In CIDR’17. [direct pdf link]
Vu Lee and Sumit Gulwani. 2014. FlashExtract: A Framework for Data Extraction by Examples In PLDI’14.
Marco D. Adelfio and Hanan Samet 2013. Schema Extraction for Tabular Data on the Web In VLDB’13.