WiNDC Getting Started Data System

Getting Started: WiNDC Data System

Introduction

The WiNDC data system is a Python package that facilitates processing data collected from government websites in order to generate the file windc_base.gdx. This file serves as input for the WiNDC build stream. Users can choose which version of the data file they wish to generate. Currently, versions 2.0.1 and 2.1 are available, see the release notes for details on the versions. Each version works with the balancing system.

The WiNDC data system (windc_data) enables users to:

Generate the data input file for the WiNDC build from source data,
Compare data of different WiNDC versions, as updated data are incorporated from the primary government sources,
Add supplemental data to the WiNDC input data file for later use in modeling exercises, and
Rebuild their own input data file system to suit their needs.

Downloads

All raw data sources files: windc_raw_data.zip (298 MB)

Python Environment

The WiNDC data sytem (windc_data) is a Python package written for Python 3.7. You will first need to create the proper Python environment in which the package windc_data will be installed. See the instructions for details.

Generate the WiNDC input data file (windc_base.gdx)

The WiNDC input data file windc_base.gdx is included in the WiNDC build package. For details on the WiNDC build, see Getting Started: WiNDC Build Stream. Here we show how to construct the GDX file windc_base.gdx from the original sources that are available from government agencies.

Download the raw data sources files windc_raw_data.zip (298 MB). Unzipping the file will create the directory windc_raw_data. This is your data directory.
Activate the windc environment.
conda activate windc
The next steps are performed in iPython. Alternatively, you could use another interactive system like Jupyter.
(windc) ... % ipython
Import the windc_data system to your session:
In [1]: import windc_data as wd
Instantiate the windc_data object by specifying the path to the GAMS system directory sysdir, the location of the data directory windc_raw_data, and the WiNDC version number.
In [2]: w1 = wd.WindcEnvironment( gams_sysdir="/Applications/GAMS30.3/Resources/sysdir", data_dir="~/Desktop/windc_raw_data", version="windc_2_1", )

Note that you need to adjust the path to the GAMS system directory and the data directory. Note further that we chose WiNDC version 2.1 (version="windc_2_1"). Alternatively, you could specify WiNDC version 2.0.1 with version="windc_2_0_1".
Generate the WiNDC input data file windc_base.gdx for your chosen version:
In [3]: w1.rebuild(gdxout=True)

The file windc_base.gdx will be built in your current working directory.

Global Label System

The windc_data system begins by reading in data from the data directory for the version specified by the user, see step 5 above. The data is read into a stack of Pandas dataframes. Pandas facilitates storing multifaceted data and has powerful data cleaning, organizing, and transforming tools.

It is pivotal to apply a uniform labeling scheme to the entire dataframe stack if diverse data sources are used within one sytem. The WiNDC data system includes a notation that is applied across the entire dataframe stack. This notation is linked to an individual column (or columns) in a specific dataframe (or dataframes) by creating a notation link. The notation and notation link in windc_data define a global label system, applied to all dataframes in the stack. If any element in the dataframe stack does not match the corresponding notation, it is flagged for potential renaming or bulk removal. The WiNDC data system includes a tool (the method test_notation()) that will scan the entire dataframe stack in search of rogue elements.

Access Data

The WiNDC data system relies on Pandas to enable easy access to data. The windc_data user will need to be familiar with Pandas in order to perform the data access operations. We took great care to facilitate access to the raw data with relative ease. We recommend working with an interactive Python system such as iPython or Jupyter, as access to the raw dataframes is enhanced by the tab completion feature. A few examples follow.

Instantiate a windc_data object as described above.
Once all the data has been read into the data object w1, you can start to interrogate the different datasets. Note that at this stage the data has not been manipulated in any way. The data exists as read from the raw data files.
Using the tab completion feature in iPython, you can explore the different methods that are available within object w1.
In [3]: w1. apply_gams_labels() clean data_dir find_column rebuild test_notation version bulk_map_column column_dtypes diff gams_sysdir remap to_csv windc_notation bulk_replace column_view drop_rows gdx_data remove_zeros transform bulk_strip data export_gdx notation_link set_gather unique_columns
The data that has been read in, exists in w1.data and is stored as a Python dictionary. You can explore w1.data with the usual Python dictionary methods. For example, to determine the names of all the tables that have been read in, find the keys of the dictionary w1.data:
In [3]: w1.data.keys() Out[3]: dict_keys(['bea_gsp', 'bea_pce', 'bea_supply', 'bea_supply_det', 'bea_use', 'bea_use_det', 'census_sgf', 'cfs', 'eia_crude_price', 'eia_emissions', 'eia_heatrate', 'eia_seds', 'emission_rate', 'state_exim', 'usda_nass'])

Using these keys, you can access the underlying dataframes:
In [4]: w1.data["bea_gsp"]

This dataframe represents raw data.

Note:
In some cases, multiple raw data files were merged together to create the single dataframe referenced as w1.data[key]. The windc_data system allows users to access each of the underlying dataframes before any merge operations were performed. An example follows.
The BEA Use and Supply tables are stored in a single Excel file where each year has its own sheet. If users wish to debug the reading of BEA Use data from 2001, they could access just this data starting with the already imported data (from the data directory windc_raw_data):
In [5]: w2 = wd.windc_2_1.parse.bea_use._2001("/Users/adam/Desktop/windc_raw_data/windc_2_1/BEA/IO/") In [6]: w2

This function will return the desired dataframe.

Note:
Users must follow the structure of windc_data to find the desired operation. In this example, we pull data from WiNDC 2.1 (version windc_2_1) that has been parsed with the parse submodule for bea_use._2001.

Note further:
Users need to specify the correct directory path to their data directory, but not the actual filename.

We recommend that you familiarize yourself with the submodule space and the methods using tab completion. The process described above can be repeated in a similar way to access other individual dataframes.

Export Data

Once the data is read into the object w1, as described above, you can export the data to CSV files using the .to_csv() method.


                In [7]: w1.to_csv()

This method takes all the dataframes that are in the dictionary w1.data and writes them to disk using the standard Pandas method, also called .to_csv(). The WiNDC data system will create a directory called standardized_data in your current working directory and stores all CSV files there. CSV files can be helpful for debugging.

Compare Data

Data from government sources is updated on a semi-annual basis. These updates typically involve data updates for the previous five to eight years. As a consequence, we need to sandbox a full suite of data sources for every WiNDC version. The data sources are often updated asynchronously. Currently, the WiNDC version releases are not coordinated with any other data source release. This may change as WiNDC matures.

Users may wish to examine how the data has been updated over time. The WiNDC data system facilitates comparing data from different WiNDC versions. We illustrate this feature by comparing the BEA Use tables from WiNDC 2.0.1 (version windc_2_0_1) and WiNDC 2.1 (version windc_2_1).

First data from government websites is read in to a Python data object, like w1 in our example above. Then the data has to be cleaned and harmonized to remove any spurious data labels or other unnecessary data oddities before it can be compared to the data of another release. The WiNDC data system provides the method .rebuild() to facilitate performing all necessary steps in one motion. The .rebuild() method cleans the data, transforms it as necessary, zero valued elements are removed, and GAMS labels are applied to the data. Users can opt to omit the final GDX file by including the argument gdxout=False:


                In [8]: w1.rebuild(gdxout=False)

This operation relies on the notation system that has been built into the windc_data system. All notation links and other cleaning and transforming operations have been functionalized within the w1 object. When the .rebuild() method is executed, users perform the same manipulations that were originally performed when building the WiNDC input data of that particular version.

The result is a clean and harmonized set of data for WiNDC version 2.1 inside the w1 data object. In the next step, we create a new data object called w2 that will contain data from WiNDC version 2.0.1. and use the method .rebuild() to process the data.


                In [9]: w2 = wd.WindcEnvironment(   

                 gams_sysdir="/Applications/GAMS30.3/Resources/sysdir",   
           
                 data_dir="~/Desktop/windc_raw_data",            
                    
                 version="windc_2_0_1", 

                ) 

                

                In [10]: w2.rebuild(gdxout=False)

Note that you need to adjust the path to your GAMS system directory sysdir and the directory windc_raw_data.

Users can compare source data from the two WiNDC versions by using the Pandas method merge. For example, if users want to compare data from w1.data["bea_use"] with data from w2.data["bea_use"], then data from w1.data[“bea_use”] need to be merged with w2.data[“bea_use”]. The data is merged with the outer join operation and some arguments. The following command is used to display the column names of the dataframe w1.data["bea_use"]. We need the column names to perform the outer join.


                In [11]: w1.data["bea_use"].columns         
 

                Out[11]: 
                Index(['IOCode', 'Row_Name', 'Commodities/Industries', 'Column_Name', 'value',
                'year', 'units'],
                dtype='object')

Both w1.data and w2.data should have the same column names at this point in the build. We want to join the dataframes w1.data[“bea_use”] and w2.data[“bea_use”] on all columns, except for value. This is the column with the numerical data we are looking to compare. We specify the merge operation as follows:


                In [12]: w1.data["bea_use"].merge(w2.data["bea_use"], on=['IOCode', 'Row_Name', 'Commodities/Industries', 
                'Column_Name','year', 'units'], how="outer", indicator="True")

The resulting dataframe contains the new columns value_x, value_y, and True. The column value_x is identical to the column value in w1.data["bea_use"] and the column value_y is identical to the column value in w2.data["bea_use"]. The column True contains indicators that specify in which dataframe this particular entry can be found (left_only, right_only, or both). Now the data exist in an aligned format that facilitates easy comparisons. Other dataframes can be compared in a similar way. More advanced merge operations are possible, see the Pandas documentation for further details.