Having been involved in the Data Warehouse/Business Intelligence/Analytics/Big Data industry since the late 90’s (that shows my age!!) I have lost count of the thousands of hours of work, debate, discussion and frustration that has gone into the topic of Data Quality and I can say right now that I am heartily sick of it.
Moaning about poor data quality indicates that we misunderstand the process we are analysing and lack the cognitive skills to be a good data analyst or data scientist. Let me show you why and (perhaps) help you …
Let’s start with an example.
I often fill out web forms in order to get access to sponsored white papers or material that I want to read. Typically my input goes like this:
First Name: xxx
Last Name: xxx
I can just imagine the marketing/CRM analyst looking at the database of potential customers and saying, “ok, so we need to delete the rubbish data that some people put in”.
Or the IT Project Manager who asks the developer, “now what validation can you put on these fields, we need to check postcodes, states, email addresses”.
But they are wrong … I have actually provided them with very valuable information.
First off if the developer has been clever he will have collected my IP address, can use that to trace the country of origin for my request. If he is lucky he may be able to get the browser sign-in details (using the OAUTH package). So now he knows there is a user with a specific browser sign-in located in Australia that wants to be anonymous. This is not rubbish data. This absolutely helps you to know how to target me – people choose to stay anonymous for a reason and you need to respect that reason and get a little cleverer with finding me and offering me relevant content.
Let’s take another example. Consider a set of data you just imported from some source which has embedded carriage returns that stuff up your ETL scripts. Here is a clear indication of an input process issue – somewhere in the IT systems that capture this information there is a “flaw” that allows CR into a data stream that you want to use for analysis. How did this occur and what other issues might we uncover? It could be that a data entry operator is tabbing across fields and entering a CR by mistake, or it could be that the extraction program has failed in some way.
In any case this data is potential gold, as it points to an improvement opportunity somewhere in the data input process, the program that captures the data or the way it is processed.
Of course I understand you have a job to do – you are under pressure from your Project Manager to get the customer database imported into the new data warehouse by tomorrow so you really do not need all these “data quality” problems!
So why am I writing this today (and not 5 years ago)? Well I think we have an opportunity to “throw so-called data quality problems” at the emerging “wall” of Deep Learning and use this to improve the way that enterprises work.
Today we can feed a Deep Learning algorithm with all our so-called poor quality data and the algorithm will learn what data is valid for analysis, what is invalid, what is an outlier, and so on. It begins the process of eliminating the ETL (extract-transform-load) that used to occupy so many hours of effort on analytics projects.
In the future I believe that we can develop algorithms that will be smart enough to learn from the source data and the systems and process steps that got the data here and then the algorithm will start to advise us on what steps in the process are problems and how to correct/adapt to what the data is telling us.
… and then Data Quality will be a problem of the past.