Example Run
Introduction¶
This is the second of three jupyter notebook tutorials introducing the popexposure Python package. In these notebooks, we will use popexposure to estimate the number of people residing near California wildfire disasters for each year spanning 2016 to 2018.
This notebook walks through the core popexposure functionality, carrying out the following calculations:
- Estimate the total number of people residing within 10 km of one or more California wildfire disasters, for each year spanning 2016 - 2018.
- Estimate the total number of people residing within 10 km of each unique California wildfire disaster, for each year spanning 2016 - 2018.
- Estimate the total number of people residing within 10 km of one or more California wildfire disasters, for each year spanning 2016 - 2018, broken down by 2020 ZCTA.
- Estimate the total number of people residing within 10 km of each unique California wildfire disaster, for each year spanning 2016 - 2018, broken down by 2020 ZCTA.
- Calculate the population for each 2020 ZCTA (to serve as denominators for the above estimates).
Notebook 01 contains all the necessary set up to run this notebook, while notebook 03 plots final results.
Library imports, data directory set up¶
# Library imports
import pathlib
import glob
import pandas as pd
import popexposure as ex
# Setting up data directories
base_path = pathlib.Path.cwd()
data_dir = base_path / "demo_data"
# Hazard wildfire data filepaths
wildfire_paths = glob.glob(str(data_dir / "02_interim_data" / "*fire*"))
# Population data filepath
ghsl_path = data_dir / "01_raw_data" / "GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0_R5_C8" / "GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0_R5_C8.tif"
# ZCTA admin data filepath
zcta_path = glob.glob(str(data_dir / "02_interim_data" / "*zcta*"))[0]
# Constant starting year to use throughout
START_YEAR = 2016
We're now set up to run the five cases we're interested in.
1. Find the total number of people residing within 10km of one or more California wildfire disasters in 2016, 2017, and 2018.¶
First, we will find the total number of people residing within 10 km of one or more California wildfire disasters in 2016, 2017, and 2018.
To do this, we will make an estimator by instantiating the main class of popexposure, PopEstimator, and then use one of the two methods available to calculate the exposed population.
The class constructor for PopEstimator requires population data as an argument. This allows us to use the same population data for all the calculations we're doing with an instance of PopEstimator. If we wanted to use different population data, we could reset this attribute of our PopEstimator before calling either est_exposed_pop and est_total_pop. In our case, we'll be using the same population data for all calculations.
If we wanted estimates by administrative unit, like ZCTAs, we'd give the PopEstimator that data as well when we construct it. But right now, we don't want that, so all we need is population data.
After we make an estimator, we can call the method est_exposed_pop to actually get the estimate of the number of people exposed.
To indicate to est_exposed_pop that we want to find the number of people living near one or more disasters, rather than the number of people near each unique disaster, we'll use the hazard_specific parameter. When we set hazard_specific to False, this tells est_exposed_pop that we want to one estimate of everyone exposed to one or more disaster, rather than estimates of the number of people exposed to every unique disaster.
Because we're looping over three years, we'll initialize an empty list first, and then store the results in this list. We're also adding a year variable to the result as we go. In total, this takes around 5 seconds.
est = ex.PopEstimator(pop_data=ghsl_path)
num_exposed_list = []
for i in range(3):
num_exposed = est.est_exposed_pop(hazard_data=wildfire_paths[i], hazard_specific=False)
num_exposed['year'] = START_YEAR + i
num_exposed_list.append(num_exposed)
# Join those dataframes together.
num_exposed_df = pd.concat(num_exposed_list, axis=0)
num_exposed_df.to_parquet(data_dir / "03_results" / "num_people_affected_by_wildfire.parquet")
num_exposed_df.head()
Our output has three columns: ID_hazard, exposed_10km, and year.
We added manually added the year column, but the other two are output from est_exposed_pop.
The result column exposed_10km is called exposed_10km because we named our buffer distance column buffer_dist_10km, so that suffix got carried through to our results.
Because we ran est_exposed_pop with hazard_specific = False, our ID_hazard column has changed. It now has the value 'merged_geoms' in every row, since the output contains the number representing the count of everyone exposed to one or more wildifre disasters in each year. est_exposed_pop dropped the wildfire IDs.
Just to highlight this again: in this first step, we wanted to count the number of people living within 10km of one or more California wildfire disasters. There are some people who lived within 10km of two or more wildfire disasters. We did not want to double-count those people. When computing this total, rather than the number of people affected by each unique hazard, est_exposed_pop takes the unary union of any buffered hazards that are overlapping, and finds the total of everyone residing within that area. This is why the column contains the value 'merged_geoms'.
2. Find the total number of people residing within 10 km of each unique California wildfire disaster in 2016, 2017, and 2018.¶
To find the total number of people residing within 10 km of each unique California wildfire disaster in 2016, 2017, and 2018, (rather than one or more disasters) we can use est_exposed_pop with many of the same inputs as in the previous step, but we do need to change the value of hazard_specific. In this case, we need to set hazard_specific to True, to get a count of the number of people within 10km of each unique wildfire disaster boundary, regardless of whether two or more exposed areas overlap.
num_exposed_list = []
for i in range(3):
num_exposed = est.est_exposed_pop(hazard_data=wildfire_paths[i], hazard_specific=True)
num_exposed['year'] = START_YEAR + i
num_exposed_list.append(num_exposed)
num_exposed_unique = pd.concat(num_exposed_list, axis=0)
num_exposed_unique.to_parquet(data_dir / "03_results" / "num_aff_by_unique_wildfire.parquet")
num_exposed_unique.head()
As in the previous step, our output has three columns: ID_hazard, exposed_10km, and year.
This time, the ID_hazard column is the same as in the data we passed to est_exposed_pop, and contains each unique hazard ID. This time, if someone lived near two or more fires, they are included in the count of people exposed for each fire. When hazard_specific is set to True, people may be double counted or triple or more if there were near two or more fires.
3. Find the total number of people residing within 10km of one or more California wildfire disaster in 2016, 2017, and 2018 by 2020 ZCTA.¶
To break down estimates from est_exposed_pop by ZCTA, we need to create a PopEstimator with additional administrative unit data. We'll pass the ZCTA data to the PopEstimator when we construct it. Then, when we call est_exposed_pop, it will automatically break down our estimates by ZCTA.
Because we want to look at exposure to one or more wildfires in this step, we'll run run est_exposed_pop with hazard_specific = False.
est = ex.PopEstimator(pop_data=ghsl_path, admin_data=zcta_path) # note the ZCTA data passed here
num_exposed_zcta_list = []
for i in range(3):
num_exposed_zcta = est.est_exposed_pop(hazard_data=wildfire_paths[i], hazard_specific=False)
num_exposed_zcta['year'] = START_YEAR + i
num_exposed_zcta_list.append(num_exposed_zcta)
num_affected_zcta_df = pd.concat(num_exposed_zcta_list, axis=0)
num_affected_zcta_df.to_parquet(data_dir / "03_results" / "num_people_affected_by_wildfire_by_zcta.parquet")
num_affected_zcta_df.head()
This computation takes a little bit longer than when we weren't looking for ZCTA-specific estimates.
This time, because we passed ZCTA data, in the output there is one row for all the merged geoms in each year, but also one row for each ZCTA. The count in row 0 is the number of people living within 10 km of one or more California wildfires for that given ZCTA.
4. Find the total number of people residing within 10 km of each unique California wildfire disaster in 2016, 2017, and 2018 by 2020 ZCTA.¶
For our final case of counting exposed people, we wanted to find the number of people living near each unique hazard in each ZCTA. For this, we need to use est_exposed_pop with ZCTA data and with hazard_specific = True.
num_exposed_zcta_unique_list = []
for i in range(3):
num_exposed_zcta_unique = est.est_exposed_pop(hazard_data=wildfire_paths[i], hazard_specific=True)
num_exposed_zcta_unique['year'] = START_YEAR + i
num_exposed_zcta_unique_list.append(num_exposed_zcta_unique)
num_exposed_df_zcta_unique = pd.concat(num_exposed_zcta_unique_list, axis=0)
num_exposed_df_zcta_unique.to_parquet(data_dir / "03_results" / "num_people_affected_by_wildfire_zcta_unique.parquet")
num_exposed_df_zcta_unique.head()
Here, we have the same columns as in step 3 - ID_hazard, ID_admin_unit, exposed_10km, and year. However, this time there is a row for each hazard and each admin unit. If people were exposed to more than one fire, they're included in the counts for each of those fires.
5. Find the population of all 2020 ZCTAs.¶
Finally, let's use the other method available in this package to get some denominators for our dataset. The method est_total_pop can help us use the gridded population data we used to find the number of people residing in each administrative unit - in this case, ZCTAs. This is useful if we're using a gridded population dataset that we think is a big improvement over other population counts in our additional administrative units, or we just want to be consistent and use the same data source to find exposed people and the total population.
num_residing_by_zcta = est.est_total_pop()
num_residing_by_zcta.to_parquet(data_dir / "03_results" / "num_people_residing_by_zcta.parquet")
num_residing_by_zcta.head()
This time, in the output, we have a column for administrative unit and a column for the number of people living in that administrative unit. Proceed to 03_demo_explore_results.ipynb to explore the output of these functions!