Getting Started: WiNDC Data System
Introduction
The WiNDC data system is a Python package that facilitates processing
data collected from government websites in order to generate the file
windc_base.gdx
. This file serves as input for the
WiNDC build stream. Users can choose which version of the data file
they wish to generate. Currently, versions 2.0.1 and 2.1 are available,
see the release notes
for details on the versions.
Each version works with the balancing system.
The WiNDC data system (windc_data
) enables users to:
- Generate the data input file for the WiNDC build from source data,
- Compare data of different WiNDC versions, as updated data are incorporated from the primary government sources,
- Add supplemental data to the WiNDC input data file for later use in modeling exercises, and
- Rebuild their own input data file system to suit their needs.
Downloads
- All raw data sources files: windc_raw_data.zip (298 MB)
Python Environment
The WiNDC data sytem (windc_data
) is a Python package written for
Python 3.7. You will first need to create the proper Python environment in which
the package windc_data
will be installed. See the
instructions
for details.
Generate the WiNDC input data file (windc_base.gdx)
The WiNDC input data file windc_base.gdx
is included in the
WiNDC build package. For details on the WiNDC build, see Getting Started: WiNDC Build Stream. Here
we show how to construct the GDX file windc_base.gdx
from the original
sources that are available from government agencies.
-
Download the raw data sources files
windc_raw_data.zip (298 MB). Unzipping the file will create the directory
windc_raw_data
. This is your data directory. -
Activate the
windc
environment.conda activate windc
-
The next steps are performed in iPython. Alternatively, you could
use another interactive system like Jupyter.
(windc) ... % ipython
-
Import the windc_data system to your session:
In [1]: import windc_data as wd
-
Instantiate the
windc_data
object by specifying the path to the GAMS system directorysysdir
, the location of the data directorywindc_raw_data
, and the WiNDC version number.In [2]: w1 = wd.WindcEnvironment(
gams_sysdir="/Applications/GAMS30.3/Resources/sysdir",
data_dir="~/Desktop/windc_raw_data",
version="windc_2_1",
)
version="windc_2_1"
). Alternatively, you could specify WiNDC version 2.0.1 withversion="windc_2_0_1"
. -
Generate the WiNDC input data file
windc_base.gdx
for your chosen version:In [3]: w1.rebuild(gdxout=True)
windc_base.gdx
will be built in your current working directory.
Global Label System
The windc_data
system begins by reading in data from the data directory for
the version specified by the user, see step 5 above. The data is read into a stack of
Pandas dataframes.
Pandas facilitates storing multifaceted data and has powerful data cleaning, organizing,
and transforming tools.
It is pivotal to apply a uniform labeling scheme to the entire dataframe stack if diverse
data sources are used within one sytem. The WiNDC data system includes a notation
that is applied across the entire dataframe stack. This notation is linked to an individual
column (or columns) in a specific dataframe (or dataframes) by creating a notation
link. The notation and notation link in windc_data
define a global
label system, applied to all dataframes in the stack. If any element in the dataframe
stack does not match the corresponding notation, it is flagged for potential renaming or
bulk removal. The WiNDC data system includes a tool (the method test_notation()
)
that will scan the entire dataframe stack in search of rogue elements.
Access Data
The WiNDC data system relies on Pandas
to enable easy access to data. The windc_data
user will need to be familiar
with Pandas in order to perform the data access operations. We took great care
to facilitate access to the raw data with relative ease. We recommend working with an
interactive Python system such as iPython or Jupyter, as access to the raw dataframes
is enhanced by the tab completion feature. A few examples follow.
-
Instantiate a
windc_data
object as described above. -
Once all the data has been read into the data object
w1
, you can start to interrogate the different datasets. Note that at this stage the data has not been manipulated in any way. The data exists as read from the raw data files.
Using the tab completion feature in iPython, you can explore the different methods that are available within objectw1
.In [3]: w1.
apply_gams_labels() clean data_dir find_column rebuild test_notation version
bulk_map_column column_dtypes diff gams_sysdir remap to_csv windc_notation
bulk_replace column_view drop_rows gdx_data remove_zeros transform
bulk_strip data export_gdx notation_link set_gather unique_columns
-
The data that has been read in, exists in
w1.data
and is stored as a Python dictionary. You can explorew1.data
with the usual Python dictionary methods. For example, to determine the names of all the tables that have been read in, find the keys of the dictionaryw1.data
:In [3]: w1.data.keys()
Out[3]: dict_keys(['bea_gsp', 'bea_pce', 'bea_supply', 'bea_supply_det', 'bea_use', 'bea_use_det', 'census_sgf', 'cfs', 'eia_crude_price', 'eia_emissions', 'eia_heatrate', 'eia_seds', 'emission_rate', 'state_exim', 'usda_nass'])
In [4]: w1.data["bea_gsp"]
Note:In some cases, multiple raw data files were merged together to create the single dataframe referenced asw1.data[key]
. Thewindc_data
system allows users to access each of the underlying dataframes before any merge operations were performed. An example follows. -
The BEA Use and Supply tables are stored in a single Excel file where each year has its own sheet.
If users wish to debug the reading of BEA Use data from 2001, they could access just
this data starting with the already imported data (from the data directory
windc_raw_data
):In [5]: w2 = wd.windc_2_1.parse.bea_use._2001("/Users/adam/Desktop/windc_raw_data/windc_2_1/BEA/IO/")
In [6]: w2
Note:Users must follow the structure ofwindc_data
to find the desired operation. In this example, we pull data from WiNDC 2.1 (versionwindc_2_1
) that has been parsed with theparse
submodule forbea_use._2001
.
Note further:Users need to specify the correct directory path to their data directory, but not the actual filename.
We recommend that you familiarize yourself with the submodule space and the methods using tab completion. The process described above can be repeated in a similar way to access other individual dataframes.
Export Data
Once the data is read into the object w1
, as described above,
you can export the data to CSV files using the .to_csv()
method.
In [7]: w1.to_csv()
This method takes all the dataframes that are in the dictionary w1.data
and writes them to disk using the standard Pandas method, also called .to_csv()
.
The WiNDC data system will create a directory called standardized_data
in your
current working directory and stores all CSV files there. CSV files can be helpful for debugging.
Compare Data
Data from government sources is updated on a semi-annual basis. These updates typically involve data updates for the previous five to eight years. As a consequence, we need to sandbox a full suite of data sources for every WiNDC version. The data sources are often updated asynchronously. Currently, the WiNDC version releases are not coordinated with any other data source release. This may change as WiNDC matures.
Users may wish to examine how the data has been updated over time. The WiNDC data system
facilitates comparing data from different WiNDC versions. We illustrate
this feature by comparing the BEA Use tables from WiNDC 2.0.1 (version windc_2_0_1
)
and WiNDC 2.1 (version windc_2_1
).
First data from government websites is read in to a Python data object, like
w1
in our example above. Then the data has to be cleaned
and harmonized to remove any spurious data labels or other unnecessary data oddities
before it can be compared to the data of another release. The WiNDC data system provides
the method .rebuild()
to facilitate performing all necessary steps in one motion.
The .rebuild()
method cleans the data, transforms it as necessary, zero valued elements
are removed, and GAMS labels are applied to the data. Users can opt to omit the final GDX
file by including the argument gdxout=False
:
In [8]: w1.rebuild(gdxout=False)
This operation relies on the notation system that has been built into the windc_data
system.
All notation links and other cleaning and transforming operations have been functionalized within the
w1
object. When the .rebuild()
method is executed, users perform
the same manipulations that were originally performed when building the WiNDC input data of that
particular version.
The result is a clean and harmonized set of data for WiNDC version 2.1 inside the w1
data object.
In the next step, we create a new data object called w2
that will contain data from
WiNDC version 2.0.1. and use the method .rebuild()
to process the data.
In [9]: w2 = wd.WindcEnvironment(
gams_sysdir="/Applications/GAMS30.3/Resources/sysdir",
data_dir="~/Desktop/windc_raw_data",
version="windc_2_0_1",
)
In [10]: w2.rebuild(gdxout=False)
Note that you need to adjust the path to your GAMS system directory sysdir
and
the directory windc_raw_data
.
Users can compare source data from the two WiNDC versions by using the Pandas method
merge. For example, if users want to compare data from
w1.data["bea_use"]
with data from w2.data["bea_use"]
, then
data from w1.data[“bea_use”]
need to be merged with w2.data[“bea_use”]
.
The data is merged with the outer join operation and some arguments.
The following command is used to display the column names of the dataframe
w1.data["bea_use"]
. We need the column names to perform the outer join.
In [11]: w1.data["bea_use"].columns
Out[11]:
Index(['IOCode', 'Row_Name', 'Commodities/Industries', 'Column_Name', 'value',
'year', 'units'],
dtype='object')
Both w1.data
and w2.data
should have the same column names at this point in the build.
We want to join the dataframes w1.data[“bea_use”]
and w2.data[“bea_use”]
on all columns, except for value
. This is the column
with the numerical data we are looking to compare. We specify the merge operation as follows:
In [12]: w1.data["bea_use"].merge(w2.data["bea_use"], on=['IOCode', 'Row_Name', 'Commodities/Industries',
'Column_Name','year', 'units'], how="outer", indicator="True")
The resulting dataframe contains the new columns value_x
, value_y
, and
True
. The column value_x
is identical to the column value
in w1.data["bea_use"]
and the column value_y
is identical to the column
value
in w2.data["bea_use"]
.
The column True
contains indicators that
specify in which dataframe this particular entry can be found (left_only
,
right_only
, or both
). Now the data exist in an aligned format that
facilitates easy comparisons. Other dataframes can be compared in a similar way.
More advanced merge operations are possible, see the
Pandas documentation for further details.