Help is at Hand. The data management lessons we did not learn with spreadsheet has now returned with a vengeance all thanks to the current explosion of analytics!
Spreadsheets proliferate like poisonous vines, slowly strangling organisations by depriving them of a single consistent set of information. In fact, it has grown immeasurably worse in recent years with an explosion of data sources of questionable quality, the emergence of Hadoop and data lakes, a plethora of overlapping and competing analytical tools, and a pivot to cloud-based, self-service offerings, often with limited governance. The parallels between both the drivers and challenges of the spreadsheet and analytics environments offer important lessons for today’s IT.
The Joy of Spreadsheets
Since their invention in 1979, spreadsheets have become by far the most popular data exploration tool for business people of all analytical skill levels. Their wide availability, ease of use, and support for simply playing with data have made them indispensable to users but unloved by IT departments trying to build data warehouses or promote carefully selected BI tools. Following years of trying to “kill the spreadsheets,” most data warehouse and BI professionals now simply shake their heads sadly at the mention of the subject.
Few recognize even today that spreadsheets themselves are not the problem. The real issue is their promotion from individual use (where they excel – pun intended – at promoting innovation) to departmental and, most dangerously, corporate use. Once unleashed to broad, shared use, the limitations in data quality management and governance become immediately apparent, leading to a rich history of spreadsheet horror tales.
In addition to human error and poor quality-management practices within spreadsheets, businesses seldom conduct due diligence when promoting spreadsheets to broader use. Furthermore, the major spreadsheet vendors have done little to create an underpinning foundation of data and process management within or around the base function to enable auditing, track source lineage, and manage its wider usage.
A Pain in the Analytics
Leaders in analytics and machine learning have focused strongly on data exploration and insight discovery, where data scientists take a leading role. However, more recently, data concerns have come to the fore as data scientists have realised the hard work and time involved in data acquisition and preparation. In this sense, the exploratory work of a data scientist exactly parallels that of a spreadsheet user. The added challenge for data scientists is the variety of sources, data volumes, and variable data quality they encounter.
The process of putting analytics into production is conceptually similar to that for spreadsheets, requiring a shift of focus from creativity and innovation to engineering, and – ideally – a shift in organisational responsibility from business to IT. However, the process for analytics is technically more complex because production analytical models must link directly to operational data – both traditional transactions and sensor or clickstream events in near real time. Furthermore, model deployment and ongoing model management demands more technical knowledge and system access than simply sharing spreadsheets.
Spread the Learning
Putting spreadsheets into production often amounts to no more than sharing the spreadsheet and its data with colleagues who run it – and often modify it – locally. Because of the limited function within spreadsheets to manage “productionization,” IT and business together must build bespoke tools, processes, and organisational workarounds. These start from honouring the value of individual innovation and using collaborative thinking and tools to evaluate the usability and correctness of the spreadsheet in all circumstances before handing it over to IT to address production issues such as sourcing and performance. As a largely organisational approach, it is easily extended to analytics.
From the architectural and technological viewpoints, the lesson to be learned from spreadsheets is that software vendors must take appropriate and increased responsibility for data management. This involves embedding stronger data governance function in their analytics offerings and enabling more flexible and powerful data and process linkages between the analytical and operational environments. The result is a significant realignment of the traditional data warehouse and lake architectures to a production analytics platform.
In contrast to the spreadsheet situation, vendors across all aspects of data management – database management systems, data integration and preparation tools, and metadata, catalogue, and governance tools – are delivering the functionality required. Although still at an early stage, leaders in analytics are turning their attention from data science (with its focus on algorithms) to data engineering that creates an overall data and process management environment where algorithms are more simply and efficiently developed, run, deployed, and managed.