Appendix: Data Quality Metrics

Establishing data quality metrics can help support higher quality data. There is no such thing as perfect data quality. Data quality is relative to the business needs of the data. Below is an overview of common data quality metrics and sample metrics. Requiring reports of data quality metrics can be an effective way to enforce quality.

Common Data Quality Metrics

Data Quality Metric

Description

Example - Buildings

Completeness

Often defined as % of data values that are complete, e.g. % of buildings with a parcel number. Mandatory elements should be 100% complete. You can also define completeness for a dataset (versus a single field).

We have a universe of 300 buildings and 250 entries. Our dataset has a completeness rate of 83%.

Validity

Each data field has a syntax, range or set of rules it should conform to. It can be the contents of a pick list (constrained list of possible values) or valid numerical ranges (greater than 0) or format like UTM. During Step 2, you should have listed what is a valid input and then ensure that input methods and checks reinforce those rules.

The date of building occupancy should be recorded as YYYYMMDD. (Note - for usability purposes, you wouldn't want to require end users to input in this format. A drop down calendar would be much easier.)

Timeliness

How timely is the data? What is the time difference between when the event, activity, etc took place and when you obtain the data? You can measure it as the time difference or frequency of collection. An example rule could be: if we collect data on paper forms, it must be inputted within a week. Timeliness should meet the business needs of the dataset.

The core building data is collected annually. A subset if fields must be updated quarterly.

Uniqueness

If an object or event is unique in the real world, it should be unique in your dataset. You could define a metric such as items in dataset / count in real world. If the metric is 100% you are on target, if not you either have too many records (duplicates) or you are missing buildings. Another example: having no more than 1 enrollment record per student per school year. This allows for multiple records of the same student in the same dataset, but with different values in related fields, e.g. year.

For our 300 buildings example, we should have no more than 300 buildings in our dataset.

Consistency

Does your dataset share fields with other datasets? For example, neighborhoods, districts, gender, etc. If so, your dataset should use the same definitions. If possible, use the same definition for shared fields (e.g. client information, address, race/ethnicity, gender). Make geographic boundaries like planning district, supervisor district or census tract standard as well. Having standards and being consistent in implementation makes it easier to combine datasets. Standards may come from City, State, or Federal governments as well as professional organizations. Consistency also applies to format. For example, dates should have a consistent format across all datasets. It can also apply to derived data, for example, how you calculate age from date of birth. You can also constrain how one data element relates to another. For example, an address with a zip code of 94102 must mean the state is "CA".

The field building use must match the definition in other datasets (e.g. commercial, public, private, etc).

Accuracy

Accuracy is the degree to which your dataset represents reality. This one is trickier to check because you have to look outside your dataset. Ways to check accuracy include comparing to similar datasets as well as doing spot checks or audits. Use the accuracy rates in sample inspections to estimate the accuracy of the dataset and/or fields. You should also establish procedures for incorporating accuracy checks into your change management processes.

Annually, staff takes a sample of buildings and compares the database to site visits.

Example Quality Rules

Below are some sample data quality requirements for each metric.

Metric

Example Rule

Completeness

Mandatory elements should be 100% complete.

Validity

Fields must meet validation requirements.

Timeliness

Collect A, B, C fields once a year, no later than D date and complete within 30 business days. Update X, Y, Z fields daily as they rely on data that is updated daily.

Uniqueness

Entries must be unique with no duplicates of the key identifier.

Consistency

X field should use the field definition used by Y department.

Accuracy

Visit a sample set of buildings (no less than Z but no more than Y) annually. Generate the list randomly. Compare sample to dataset entries to create an estimate of accuracy.

For each rule, identify how you will measure and track it.

Data Quality Plan Checklist

Use the list below to develop your data quality requirements and plan.

Metric

Plan

Integrity

Plan for encouraging valid input and ongoing profiling of data

Completeness

Plan for monitoring and improving completeness of data

Uniqueness

Plan for ensuring uniqueness of data

Consistency & Interoperability

Plan to use standards for design, field definition, etc that make it easier to combine with other datasets; Plan for ongoing assessment

Accuracy

Plan for ongoing assessment of accuracy and need for improved accuracy

Last updated