Common Issues with Data Quality

It’s easy to assume that once you’re gathering data from sources and analysing it, unlocking the value of the data will be straightforward. Obviously, nothing’s that simple.

Firstly, let’s presume that your data team is perfect, and all of the work they’ve done to collate and transform the data is faultless… Okay, that would be a very dangerous assumption, but in this case, it’s for simplicity rather than hubris - issues that can arise from ETL (Extract, Load, Transform) and faulty analysis will be dealt with in a separate post.

Firstly, let’s think about where our source data is coming from. Unless, we’re referring to hand-written, paper data (yes, that does still exist), then it will come from a system of some sort. Typical ‘systems’ would be CRM (Customer Relationship Management- for managing dealings with customers); EPOS (End Point of Sale - eg retail/hospitality sales transactions); ERP (Enterprise Resource Planning - operational systems eg Finance/HR; websites - traffic and engagement; marketing platforms eg Google/Meta Ads - performance of marketing spend.

There are some fairly basic ways these systems can go wrong:

Missing data
- Data was never recorded
- Data has been lost in transit
Incorrect data
- Values are recorded incorrectly
- Data has been assigned incorrectly
Duplicate data
- Data has been recorded twice
- Lack of standardisation across separate systems means the same information ‘looks’ different
Data between systems doesn’t match
- Systems use different identifiers/naming conventions

The simplest example for missing data is where human input is required. You have an outbound sales team who have targets to book appointments with B2B (business-to-business) clients. Some of the prospects are tied in to another supplier for a contracted period. You need your sales team to gather as much information as possible about the prospective clients’ needs, but you also need them to record when their current deal is due to end. Periodically, you run a report to list all of the prospects that are at the end of their existing contracts in the next few months. If your sales team haven’t recorded the renewal dates for half of the prospects, these will never show up in your report (note, even better than a report would be a trigger for action eg x weeks/months before renewal, a call to the prospect is triggered in the CRM system).

Where no human input is required, it could be assumed that there will be no data loss. However, there are multiple ways this could occur. Behind every system is code. Within the code, values are often stored in temporary memory, and complex logic can decide what happens to those values. If there is an error, and this isn’t handled correctly, these temporary values may not be written to a permanent data store. Modern systems communicate using APIs. An error, or communication issue can cause data to be lost. Again, the code needs to handle this correctly. In this case, it could be that a telephone operator has made a call in one system, and this action and outcome aren’t recorded in the collated data. If the operator has targets relating to these actions, this missing data could impact on their performance metrics. From a business perspective, missing data means the picture you see is incomplete.

For data entered by humans, the starting point for correct data is in the system design - making data entry as simple and efficient as possible using dropdown lists, toggles etc. However, some attempts to enforce data completion such as mandatory fields, can result in anomalies in the data. Where a field is made mandatory, but is not applicable to all scenarios, a field can end up being populated with a variety of ‘not applicable’ values (spaces, ‘...’, ‘N/A’, ‘NA’, ‘not app’ etc), or worse, the first item in a dropdown list. The latter can falsely skew analysis of data. Beyond this, it is essential to ensure that data is recorded accurately by a system.

A great example of this is the recording of times of completed actions. This history of standardisation of dates and times is pretty interesting, but from a data fundamentals perspective, you want to know when an action was completed in 1/ a consistent timezone 2/ the local timezone. Let’s say you’re a UK company who deals with customers in Australia, but you also have a team who assess the customer’s requests in the US. In this example, we’ll presume it’s Winter, and the UK is using GMT (Greenwich Mean Time), which equates to UTC (Universal Coordinated Time - note, acronym is the French version…!) . The customer is in Western Australia, and the resolution team is in the Eastern United States.

Local Time UK time (UTC/GMT)

Action 1 (Enquiry) 22:00 (GMT +8) 14:00

Action 2 (Resolution) 11:00 (GMT -5) 16:00

Using UTC, we would say that the time taken to resolve this was two hours (ie 14:00 - 16:00). However, if we had only recorded the local time, the elapsed time would be -11 hours, which is clearly wrong… Just storing UTC may also be insufficient - if we wanted to look at trends in enquiries or resolutions, using UTC would create some strange-looking results- it could look like our peak times were in the middle of the night! This means we either have to store multiple timestamps, or store the timezone of the action. This international processing arrangement probably wouldn’t make much business sense though, as many enquiries would occur while the US were sleeping…

Another common cause of inaccurate data is precision. In the previous example, we used timestamps. If we want to analyse actions and the time elapsed between these actions, or if we want to order actions chronologically, it makes sense to record the time of its occurrence down to the millisecond. If data were only stored to the nearest minute, for example, this could seriously compromise audit trails and analyses. Precision also has a crucial role to play in financial metrics.

As always, this is a very simplified example. A bank pays interest on savings. This bank is more worried about simplicity than serving customers, so they calculate interest once a year, based on the average daily balance in the period. The table below shows the impact if the interest or the balance were stored without the correct number of decimal places. The third row is the most egregious - the interest rate hasn’t even been rounded, the final decimal place has simply been cut off. In our basic example, we’re using errors that amount to pence and pounds, but scaled up to billions of combined balances, this would amount to millions of pounds.

These mistakes in a database would be unlikely, as it’s likely to have been designed by someone familiar with these scenarios, and they would be easily visible to multiple users. Within the calculations and logic of lines of code, basic errors can easily suppress decimal places, which means the loss of precision isn’t seen. From a systems perspective, coding standards and rigorous realistic testing are the key, but beyond that, data teams have a crucial role in monitoring data for anomalies., and ensuring that data quality issues aren't created in the onward journey of the data. This will be discussed in future articles.

ATAD also have experience with implementing and maintaining the standards of ISO 27001 - the best known standard for information security management systems (ISMS). The three core tenets of the ISMS standard are: Confidentiality, Integrity and Availability. Any of these, either by insufficient controls on access to data, or insecure systems can lead to data being corrupted, compromised or even deleted. Evidently, these are pretty catastrophic cases, but without adequate controls in place, these can prove an existential threat to any organisation.

There are numerous ways duplicates can occur in your data. As with all data, the most likely sources are human entry - by customers or employees. Creating a ‘single customer view’ to combine similar-looking customer data into one unique record for each is a whole other topic, but on a basic level, it’s easy to see how ‘1 High Street’ and ‘1 High St’ could create different records without logic to deal with it. From a system perspective, flaws in system logic can cause data to be written twice. Postcode lookups on websites can be annoying if yours doesn’t appear, but these provide consistently formatted address data, and go a long way to resolving this issue. Treating one customer as multiple customers and double counting in your analyses is not ideal, but clearly duplicating revenue or costs could have a more significant impact on your reporting!