Data Pipeline

Set it and forget it...

What is an automated data pipeline?

An automated data pipeline is a system that automatically moves data from a source to a destination without needing a person. They can be set to run every day or even every hour and are a great tool to ensure the data you are sharing with the public is accurate and up to date.

Data pipelines can also make changes to data as it is moved. This can be useful if you need to remove personally identifiable information ("PII"), add columns, or make calculations.

There are a lot of ways to build a pipeline which we will go over in the next section.

Data Pipeline Tools

DataSF has an ever-growing toolkit to move data from one destination to another. Below we will outline different tools to Extract, Transform, Load Data onto the Open Data Portal. The next section will outline how to combine these tools together to create a data pipeline.

Extract

What tools we use to get the data out of your database or spreadsheet

  • SFTP: If you or a databases administrator in your department already has a method for querying or extracting data from your database, DataSF can set up a folder on a SFTP server where you can share the dataset with us

  • Safe FME: An ETL tool primarily used for geospatial data. It has pre-built "readers" which can extract data from databases.

  • Socrata Gateway: Socrata, our open data platform vendor, has pre-built connectors which can pull data from various sources. It requires you to download an "agent" which has credentials to access your database (or parts of your database). You are not able to do transformations with this tool so data has to be 'clean' already.

  • Azure Data Factory: Microsoft Azure has many pre-built integrations. It can pull data from many popular data warehouses (SQL Server, Oracle), pull from a REST API, SFTP servers and more. A very flexible tool which also allows staging on Azure Storage.

Transform

Tools we use to make modifications to the data before putting it on the open data portal

  • Safe FME: Best for transforming geospatial transformations such as adding Supervisor Districts or finding the closest intersection from a point.

  • Snowflake/DBT: Data can be loaded into snowflake and can be transformed with the SQL-based dbt tool. This is a great tool to join datasets together, perform business logic (sum, filter, normalize, pivot, etc).

  • SoQL: Or "Socrata Query Language" is a SQL-like language which is included in our Enterprise Data Platform. Using this language you can perform simple functions such as joining datasets or filtering columns.

  • Azure Data Flows: A tool within the Microsoft Azure suite of products which allows one to perform common SQL transformations on data stored within Azure Storage or SQL Server.

Load

Tools we use to upload data onto the open data portal

  • Safe FME: An ETL tool primarily used for geospatial data. It has pre-built "writers" which can load data into our open data portal.

  • Socrata Gateway: Socrata, our open data platform vendor, has pre-built connectors which can load data into the platform after extracting it. It requires you to download an "agent" which has credentials to access your database (or parts of your database).

  • REST API/URL: Socrata can pull data onto it's platform utilizing another systems REST API or similar URL/URI formats such as Microsoft Shared Access Signature (SAS).

Common Examples

A data pipeline setups based on common scenarios

  • Data in a spreadsheet:

    • Spreadsheet -> FME -> Open Data

      • Our first recommendation is to get this data into a more robust database. If that is not possible, we can use FME to pull data out of a Sharepoint excel spreadsheet and load that data onto the Open Data Portal. With this system, you need to be very careful that nobody changes the columns or types in incorrect values as this could break the pipeline

  • Data in Database, but can't grant access:

    • SFTP -> Gateway or FME -> Open Data

      • If you cannot give us access to your database, we recommend that you set up a job which uploads data onto an SFTP folder (e.g. load last week's data every Monday). From there, we can extract it with a Socrata Gateway (if no transformations are necesary) or FME (if geospatial transformations are required)

  • Data in Database, needs complex transformations:

    • SFTP -> Azure Data Factory -> Snowflake -> Socrata

      • If the data needs complex transformations, instead of just pulling the data directly from the SFTP folder to Open Data, we will first use Azure Data Factory to move the data to a staging area (Azure Storage) where it can be pulled into snowflake. Next, snowflake is used to perform any transformations which are necessary (utilizing dbt for orchestration). When all the transformations are complete, we move the new snowflake tables into Open Data using either a Socrata Gateway or REST API.

Last updated