Step 2: Define the Dataset
Last updated
Last updated
In Step 1, you collected needs and requirements. You defined the purpose, identified your stakeholders, and assessed their needs and requirements. In Step 2, you will define the data you will obtain. Follow the steps below to define your data model. The steps will be iterative:
2.1 Collect data standards and requirements
2.2 Define your tables
2.3 Define your fields
2.4 Define business rules
Assemble existing data standards and requirements, such as:
Naming conventions and standards
Existing field definitions and standards (stay tuned for future resources on this)
Data documentation requirements
Reference data lists and sources
Modeling tools or guidance and/or expectations
You'll reference and incorporate these as you build your model.
Your data may be a single table. If you have more than one table, you may want to create a diagram. This is known as data modeling or creating an entity relationship diagram. There are many helpful templates and references for doing this. The basic process is to:
List and define the objects, people, places, events or concepts in your business process.
Describe how they relate to each other.
You can use powerpoint or diagramming software (there are many) to model your data.
For each table, define the fields in the table. As appropriate, capture the following for each field:
Item
Description
Field Name
What is a brief human readable name for the field? This should be unique and stated in the singular. Don't use abbreviations or acronyms.
Short Name
What is the short, programmatic name for the database. Use existing naming conventions if available.
Type
What is the format of the data? Numeric, text, data and time, currency, boolean, coordinates, other.
Definition
Define the field. State it as a descriptive phrase or sentence. Be precise, unambiguous and concise. Make it unique and state it in the singular. State what it is, not what it is not. Can it stand alone? Define units of measure (e.g. miles or meters or feet).
Required
Yes or No. Is this field required or optional?
Valid inputs
For categories, what is the list of allowable values? For date and numbers, what is the allowable range?
Source
Defines how the data is entered, for example:
Standard
Yes/No. Is there an existing standard definition you can use? Be sure to review organizational as well as industry standards. We will be providing more guidance on this issue. Note that some fields such as mailing address may be multi-fields (Street1, Street2, City, State, Zip+4 etc).
Guide for use
Provide advice on how the field can be used.
Adapted from:
"A guide to data development", Australian Institute of Health and Welfare, Caberra, 2007.
"Guide to Using Data Collection Form and Data Specifications", Adult Cardiac Surgery Database, v2.61 Training Manual, Society of Thoracic Surgeons.
What constraints do common sense, policy and users put on the data? What input triggers an action or event? For example:
House numbers on a certain street must be within a certain range
Real estate sale prices shouldn't be negative
And no one alive today was born before 1900
When a 'status' field changes to "closed", automatically populate the 'dateclosed' field with the current date and time
Restate business rules in the language of your database (e.g., Date of Birth >= 01/01/1900). Enforce business rules at the point of collection or during maintenance. This will help you identify:
Mistakes such as incorrect data input or
Systemic issues, such as out of date reference data or address updates not being available in a timely manner
Use the following questions to help identify business rules:
What are examples of nonsensical entries? What rules would exclude nonsense?
What commonsense rules follow from the time course of events associated with each record?
If two records describe earlier and later versions of one entity, are the two records consistent?
For each money transaction, there is a corresponding transaction somewhere in the world. Should that transaction be visible somewhere else in the database?
What laws or policies govern the subject of the data? What rules do laws and policies impose?
The following business rules are from the design of a Campaign Finance database.
A report from B that states B contributed to A should have a match report from A that states that A received a contribution from B
Starting balance on report should match ending balance on previous report
If no previous report, starting cash balance is $0
Date of transaction must be within proper date range of the report.
Total of contributions from individual to candidate must be less than or equal to legal limit for each time period
Amended status must be "N" for first report submitted for any given year and cycle
Amended status must be "Y" for all but first report submitted for any given year and cycle
Expense descriptions are not null and must have content other than "expense", "reimbursement", etc
Employer and Employer Address are not null for larger donations (as defined by law)
All amount fields must be positive