Skip to main content

Getting Started: WiNDC 2.0 Data Stream

Introduction

The input data for the WiNDC 2.0 build stream are collected from various government websites, they are parsed, cleaned and harmonized, similar data are joined, all data are scrubbed to remove unnecessary information and finally they are transformed to a format that can be read by GAMS. The output of this process is the GDX file windc_base.gdx. It is included in the WiNDC build stream package, windc_build.zip (163 MB).

This document describes how to construct the GDX file windc_base.gdx from the original sources that are available from government agencies.

The files that facilitate data processing are written in Python 3.6. Python code relies on Python packages, so you first need to create the proper Python environment where all necessary packages are installed. See the instructions for details.

Downloads

  • All files for parsing and processing the data are available at GitHub. Click on the button "Clone or Download" to clone the repository to your local machine or simply download a .zip file (80 KB) with all files.
    Take care to not disrupt the directory structure. All data parsing should be performed in the directory windc_datastream.
  • All raw data sources files: datasources.zip (153.6 MB).
  • An SQL database with all data sources, readily parsed and joined: SQL (198.1 MB).

The WiNDC 2.0 Datastream

In this section we describe the data stream that creates the windc_base.gdx data input file, so that users can construct this file themselves should they wish to do so. The modular design of the data stream enables a wide audience to engage deeply with the data and simplifies debugging processes. For more information, see Adam Christensen's presentation.

The raw data files that are available from various government agencies have different types (CSV, EXCEL), shapes and structures. These data structures must be unified and made ready for input into a GAMS model. The process of converting the raw data to the the GAMS readable GDX format requires three steps: data parsing, data scrubbing and data transforming.

Data Parsing

In this step the data are extracted from raw data files, cleaned, harmonized and in some cases similar data are joined in one file. The following image illustrates this step:

Data from government websites is parsed using Python, the output may be a csv file or
              directed to an SQL database. Then the parsed files are joined using Python, again, the
              output may be a csv file or directed to an SQL database.

Raw data are collected from the following government websites:

BEA IO Supply Table (71 Industry) BEA IO Supply Table (Detailed) BEA IO Use Table (71 Industry)
BEA IO Use Table (Detailed) 2012 Commodity Flow Survey EIA Annual Crude Oil Prices
EIA Average Tested Heat Rates EIA State Energy Data System BEA Gross State Product (Annual)
USDA NASS BEA Personal Consumer Expenditures EIA State Energy Data System
Annual Survey of State Government Finances USA Trade

The raw data files from all these government agencies are available for download (datasources.zip, 153.6 MB). Download the compressed file datasources.zip to the main directory, windc_datastream-master. Unzip and thus create the directory datasources. All parsing scripts will pull data from this directory. The raw files are pased with the script parse_XXX.py. You can output the parsed data as a formatted CSV file or direct it to a SQL database. Directions are given below.

A data harmonization step is part of the parsing process. For example, it is important that all values that are in "US Dollars" and "USD" have the same format. We chose "us dollars". This harmonization and labeling step is controlled by the files in the directory core_maps. Please do not modify these files, as they are used in the build stream. Unintended inconsistencies can result in spurious results or domain checking errors. We provide these files to help you map the naming conventions to a more human readable format.

The files in the directory core_maps/gams control how the raw data is converted to a consistent GAMS-friendly format. Please do not modify these files to avoid uninteded conflicts.

If necessary, raw data that exists across several files is concatenated with the script join_XXX.py.

Creating a CSV file

If you wish to transform the 71 Industry BEA Supply Input-Output tables from their native EXCEL format into a .csv file, use the following command:

python3 parse_bea_supply_io.py --csv-out

Follow this pattern to convert other data source files to a .csv format.
We provide the file all_parse.py for users who want to parse all data in one step. In this case, run the following command:

python3 all_parse.py --csv-out
Loading the data into an SQL database

We assume that you are using a PostgreSQL database as the storage container. PostgreSQL is an open-source, production grade flavor of SQL. Installation details for PostgreSQL can be found here. First you need to create an empty database. Then you can direct data to this database where a new table will be created for the respective dataset. The following command creates a table with the data from the 71 Industry BEA Supply Input-Output tables in your SQL database:

python3 parse_bea_supply_io.py

If you wish to parse all data in one step and load them to your SQL database, run the following command:

python3 all_parse.py

All details of the SQL connection engine can be found in the file sql_engines.py. If you wish to use a different flavor of SQL, you need to modify this file. However, note that the flavor of SQL has to be compatible with the Pandas method .to_sql. For more information, see the Pandas documentation.

Data Scrubbing

In this step the data are scrubbed to remove unnecessary information that should not exist in the final dataset. The scripts scrub_XXX.py perform this task. The following image illustrates this step:

First data from government websites is parsed using Python, the output may be a csv file or
              directed to an SQL database. Secondly, the parsed files are joined using Python, again, the
              output may be a csv file or directed to an SQL database. Thirdly, the data are scrubbed 
              using Python: information like regions or years that should not be in the final dataset is removed.
              The scrubbed data may be extracted as a csv file or directed to an SQL database.

The scrubbing operations may be started in two ways:

  • from the raw data files in the directory datasources,
  • from input data that are already parsed and compiled in an SQL database made available from WiNDC.

If you wish to start from the SQL database provided by WiNDC, please follow these steps:

  1. Install PostgreSQL by following these instructions. PostgreSQL is an open-source, production grade flavor of SQL. If you wish to use a different flavor of SQL, you need to modify the file sql_engines.py. However, note that your chosen flavor of SQL has to be compatible with the Pandas method .to_sql. For more information, see the Pandas documentation.
  2. Download the file windc_sql.gz (198.1 MB) and move it to the directory windc_datastream-master.
  3. Execute the following command in the directory windc_datastream-master:
    gunzip -c windc_sql.gz | psql dbname
    Here dbname is the local name of the created database. Note that this name should be consistent with the name in the file sql_engines.py. The command above is a piped operation between the gunzip utility and psql and is appropriate for larger databases. Your local SQL database is now ready to be used.

If you wish to interrogate the scrubbed State Energy Data System (EIA-SEDS) dataset starting at the SQL database, execute the following command:

python3 scrub_seds.py --csv-out

Alternatively, if you wish to create a CSV copy of the data from the original data flow in the datasources directory, execute:

python3 scrub_seds.py --csv-out --no-sql

Note that simply executing python3 scrub_seds.py will not output anything.
This pattern of command line boolean flags is the same for all scrub_XXX.py files.

Note that you can generate all scrubbed CSV files in one step with the following command:

python3 all_scrub.py --csv-out [--no-sql]

Here the flag is --no-sql is optional and depends on your choice of source for the input data.

Data Transforming

In this step the data are transformed to prepare them for the downstream WiNDC processes. Zeros are removed and data are mapped to GAMS domains. The ouput is the GDX database windc_base.gdx that serves as input for the WiNDC build stream.

If you wish to create the file windc_base.gdx from scratch from the SQL data origin, execute the following command:

python3 all_gamsify.py

To create the GDX file from the raw data files, run:

python3 all_gamsify.py --no-sql

Note that the file windc_base.gdx must be moved to the directory windc_build/build_files for the WiNDC build stream.

The image below illustrates the whole data stream process:

First data from government websites is parsed using Python, the output may be a csv file or
              directed to an SQL database. Secondly, the parsed files are joined using Python, again, the
              output may be a csv file or directed to an SQL database. Thirdly, the data are scrubbed 
              using Python: information like regions or years that should not be in the final dataset is removed.
              The scrubbed data may be extracted as a csv file or directed to an SQL database.
              Then the data is made compatible with GAMS by removing zeros and mapping data with GAMS domain.
              This step is performed with Python. The final step the data is saved in a GDX file using Python.
              This GDX file is the singular point of entry for the WiNDC build stream.