🖨️
DataSF | Publishing Process
  • Introduction
  • Why Publish Data?
    • Publishing Data Standards
  • Publishing Specifications
    • Kickoff
      • Breadcrumbs and Inventory
    • Privacy Toolkit
      • Privacy Toolkit Form
    • Security Toolkit
  • Data Pipeline
    • Pipeline Basics
      • Manual Publishing
      • Data Pipeline
      • Common Transformations
  • Metadata
    • Intro to Metadata
      • Metadata Standards
  • Publishing & Maintenance
    • Review & Publishing
    • See our other explainers
Powered by GitBook
On this page
  • What is an automated data pipeline?
  • Data Pipeline Tools
  • Extract
  • Transform
  • Load
  • Common Examples
  1. Data Pipeline
  2. Pipeline Basics

Data Pipeline

Set it and forget it...

PreviousManual PublishingNextCommon Transformations

Last updated 11 months ago

What is an automated data pipeline?

An automated data pipeline is a system that automatically moves data from a source to a destination without needing a person. They can be set to run every day or even every hour and are a great tool to ensure the data you are sharing with the public is accurate and up to date.

Data pipelines can also make changes to data as it is moved. This can be useful if you need to remove personally identifiable information ("PII"), add columns, or make calculations.

There are a lot of ways to build a pipeline which we will go over in the next section.

Data Pipeline Tools

DataSF has an ever-growing toolkit to move data from one destination to another. Below we will outline different tools to Extract, Transform, Load Data onto the Open Data Portal. The next section will outline how to combine these tools together to create a data pipeline.

Extract

What tools we use to get the data out of your database or spreadsheet

  • SFTP: If you or a databases administrator in your department already has a method for querying or extracting data from your database, DataSF can set up a folder on a SFTP server where you can share the dataset with us

  • Safe FME: An ETL tool primarily used for geospatial data. It has pre-built "readers" which can extract data from databases.

  • Socrata Gateway: Socrata, our open data platform vendor, has pre-built connectors which can pull data from various sources. It requires you to download an "agent" which has credentials to access your database (or parts of your database). You are not able to do transformations with this tool so data has to be 'clean' already.

  • Azure Data Factory: Microsoft Azure has many pre-built integrations. It can pull data from many popular data warehouses (SQL Server, Oracle), pull from a REST API, SFTP servers and more. A very flexible tool which also allows staging on Azure Storage.

Transform

Tools we use to make modifications to the data before putting it on the open data portal

  • Safe FME: Best for transforming geospatial transformations such as adding Supervisor Districts or finding the closest intersection from a point.

  • SoQL: Or "Socrata Query Language" is a SQL-like language which is included in our Enterprise Data Platform. Using this language you can perform simple functions such as joining datasets or filtering columns.

Load

Tools we use to upload data onto the open data portal

  • Safe FME: An ETL tool primarily used for geospatial data. It has pre-built "writers" which can load data into our open data portal.

  • Socrata Gateway / SocrataPy: Socrata, our open data platform vendor, has pre-built connectors which can load data into the platform after extracting it. It requires you to download an "agent" which has credentials to access your database (or parts of your database).

  • REST API/URL: Socrata can pull data onto it's platform utilizing another systems REST API or similar URL/URI formats such as Microsoft Shared Access Signature (SAS).

Common Examples

A data pipeline setups based on common scenarios

  • Data in a spreadsheet:

    • Spreadsheet -> SFTP -> Azure Data Factory -> Snowflake -> Open Data

      • Our first recommendation is to get this data into a more robust database. If that is not possible, you can drop the file into SFTP. From there we can pull it into our infrastructure using Azure Data Factory and publish it with the Socrata Gateway.

  • Data in Database, but can't grant access:

    • SFTP -> Azure Data Factory -> Snowflake -> Open Data

      • If you cannot give us access to your database, we recommend that you set up a job which uploads data onto an SFTP folder (e.g. load last week's data every Monday). From there, we can extract it with Azure data factory, clean or transform with Snowflake, and publishing the the Socrata Gateway (using SocrataPy)

  • Data in Database, needs complex transformations:

    • SFTP -> Azure Data Factory -> Snowflake -> Socrata

      • If the data needs complex transformations, instead of just pulling the data directly from the SFTP folder to Open Data, we will first use Azure Data Factory to move the data to a staging area (Azure Storage) where it can be pulled into snowflake. Next, snowflake is used to perform any transformations which are necessary (utilizing dbt for orchestration). When all the transformations are complete, we move the new snowflake tables into Open Data using either a Socrata Gateway or REST API.

Snowflake/DBT: Data can be loaded into snowflake and can be transformed with the SQL-based tool. This is a great tool to join datasets together, perform business logic (sum, filter, normalize, pivot, etc).

riters

List of FME Readers
List of Socrata Gateways
dbt
List of FME W
List of Socrata Gateways
SocrataPy