Getting Started with Geospatial Analysis
Using geographic data and geospatial images to study climate changes, natural disasters or human activity.
Hands-on Tutorials on SageMaker Studio Lab
Geographic data includes geospatial data captured using satellite imagery and geographic positioning systems (GPS) and other geographic data generally described explicitly in terms of geographic coordinates. Geospatial analysis includes collecting, reporting, plotting, and analyzing this data using statistical methods, and machine learning. Geospatial analysis is used for understanding the impact of climate changes, natural disasters, or human activity, usually on a specific location.
Challenges Working With Geospatial Data
There are some challenges when working with geospatial data.
First, geospatial data can be huge, making it difficult and sometimes impossible to work without high capacity compute. A workaround to address size complexity is to narrow down your focus or study area and the time period. It is relatively easy to retrieve and analyze a small subset.
Next is the complexity around the different formats. Geospatial data is captured in multiple formats using compression algorithms that make data transmission and storage efficient. In addition, when you are working with geographic data you will often find yourself working with multiple sources. These sources may capture data with different projections (more on that later). This involves preprocessing and re-projecting before you can use it for analysis.
Finally, as the satellites move around the earth, data is captured in batches over a period of time. You may not have sufficient data or sufficient quality data for the specific time and location you are trying to study. In such cases, you may need to adjust your time window or rely on data processing to clean and augment this data.
Prerequisites
This post covers the basics of getting started with Geospatial Data Analysis on SageMaker Studio Labs. We explore available Geographic Datasets and then explore a Geospatial Dataset available at AWS open data registry. We explore California Lakes and Counties using geographic datasets and then focus on Lake Shasta in California using Sentinel-2 geospatial data and calculate spectral indices.
If you want to follow along, you will need the following prerequisites, all of which are free:
- A SageMaker Studio Lab account
- A Free Tier AWS account
- A Free Trial Sentinel Hub account
Amazon SageMaker Studio Lab offers CPU and GPU environments so you can complete all steps described here using the free environment.
Setting Up Your Environment
Amazon SageMaker Studio Lab is a free online web application for learning and experimenting with data science and machine learning using Jupyter notebooks. With Amazon SageMaker Studio Lab you can save and resume your work, access the command line and clone git repositories.
One of the first things you will need is to create a custom environment within SageMaker Studio Lab. Creating an environment in Studio Lab is easy. Go to your cloned directory, select the environment.yml file, right-click the YAML file, and create an environment. You can also choose to automatically build the environment.yml while cloning the repository in Studio Lab. This will create a new Studio Lab kernel with all the packages needed. After the environment creation is complete, you can open the notebook and select the newly created kernel.
Optionally you can also uncomment the package installation section of the notebook to install these packages manually.
Downloading The Data
Geographic data is generally available in Shapefiles, Geopackage, or Geojson formats. Let’s start by downloading shapefiles that include the geographic vector data for California counties and water bodies.
- The CA Counties Dataset contains boundaries for California counties, and places from the US Census Bureau’s MAF/TIGER database.
- The California Water Bodies dataset is published by the California Department of Fish and Game, Marine Region.
After the files are downloaded, we need to unzip them to your local directory.
EDA with Geographic Data
Once we have the data locally, we can read and start exploratory data analyses. In our case, we downloaded shapefiles. A shapefile is an ESRI vector data storage format that stores the location, shape, and attributes of geographic features. The geopandas Python package makes it easy to read shapefiles and create a Geopandas DataFrame. A geopandas.DataFrame is a pandas.DataFrame that has a column with geometric data, in addition to the standard pandas.DataFrame attributes, with two additional ones for CRS and geometry, both optional. CRS is used to specify the coordinate reference system of the geometric data. The geometry contains the actual coordinates of the geographic data and is set as a geometry data type within the geopandas.DataFrame object.
Let’s read the counties shape data file into a geopandas.DataFrame.
Like standard DataFrames, a geopandas.DataFrame has a plot method that you can use to create geographic visualizations.
Similarly, we will read the California Lakes shapefile into a geopandas.DataFrame and plot it.
Data Wrangling
We briefly mentioned projections and coordinate reference systems earlier. Coordinate reference systems represent three-dimensions data, i.e., data in our geopandas.DataFrame, to real locations on earth. When you use geographic data from different sources, the chances are that they use different projections.
We want to overlay the California lakes dataset with the counties dataset and visualize the lakes along with California counties. Before we can do that, we need to check and ensure they are projected using the same coordinate reference system (CRS). The crs attribute of a geopandas.DataFrame object does this. When you check coordinate reference system projections using counties.crs and lakes.crs, you notice California counties and lakes data have different CRS. The Counties dataset uses EPSG:3857, and the Lakes dataset uses EPSG:4326 as their respective CRS. Before we can use these datasets together, we will need to re-project the lakes to have the same CRS as the counties.
Once you have the geographic datasets in the same projection, you can overlay and plot them.
A common task in geographic data analysis is narrowing down the area of study. We can select a subset of data to create a new geopandas.DataFrame object for further analysis and visualization. For our example, let’s focus on Lake Shasta.
Once we have our area of interest selected, it becomes easy to visualize and study it better.
Working With Geospatial Images
Next, we move from geographic vector data to geospatial images. For geospatial images, we will use Sentinel-2. The Sentinel-2 mission is a land monitoring constellation of two satellites that provide high-resolution optical imagery and continuity for the current SPOT and Landsat missions. The Sentinel-2 dataset is available publicly at the AWS open data registry.
The sentinelhub python package makes it easy to search and download data specific to our focus area directly. The following code snippet shows how to configure the connection. In the example, I am using an optional JSON file to store and retrieve my credentials.
The Sentinel-2 dataset on AWS contains global data since January 2017, and new data is added periodically. Before we download, we need to specify the search coordinates we want to study and the time window. In our case, we are focusing on the Lake Shasta region, which we specify as a bounding box and a random time period.
Sentinel-2 data is captured and distributed as tiles, making it easy to transmit across the web. We iterate over the available tiles for our search criteria and select a specific tile. For best results, we pick a tile with the least cloud coverage.
The Sentinel-2 satellites each carry a single multi-spectral instrument (MSI) with 13 spectral channels in the visible/near infrared (VNIR) and short wave infrared spectral range. For our example we will download eight specific bands for analysis. You can read more about these bands here.
Along with the spectral bands, Sentinel tiles also include a preview image. Let’s check that out first to ensure our area of interest is captured completely and clearly.
It’s also a good practice to spot-check a few additional bands to make sure we have everything. We plot Band 7 — Vegetation Red edge, Band 8 — NIR, and Band 8A — Narrow NIR.
Working with Raster Data
Geospatial data is essentially comprised of raster data. Sentinel-2 uses GeoTIFF, a gridded raster data format for satellite imagery and terrain models. A geospatial raster is similar to a digital image but is also accompanied by spatial information that connects the data to a particular location. This includes the raster’s extent and cell size, the number of rows and columns, and its coordinate reference system (CRS). The rasterio Python package can be used to read, inspect, and write geospatial raster data. Here we use rasterio to read these raster arrays and then create a true-color image.
We use rasterio for creating a true-color image from Sentinel-2 bands. A true-color image has a large file size. Verify your available storage before you create one.
Visualizing a tiff image directly within Jupyter is not straightforward. You will need GIS software to open and view this. The true-color image of the Lake Shasta region below was rendered using QGIS.
Calculating Spectral Indices
Spectral indices are combinations of the pixel values from two or more spectral bands in a multispectral image. Spectral indices are designed to highlight pixels showing the relative abundance or lack of a land cover type of interest in an image. Let’s look at three spectral indices.
Normalized Difference Vegetation Index (NDVI)
The Normalized Difference Vegetation Index (NDVI) is a graphical indicator used to analyze whether the area being observed contains live green vegetation. It is calculated as NDVI = (NIR - Red) / (NIR + Red) where NIR is the Near-infrared Band 8 and Red is Band 4.
The earthpy Python package can be used for plotting spectral bands. We use it here to visualize the Normalized Difference Vegetation index around the Lake Shasta region.
You can see areas with vegetation in green, areas with more vegetation as darker shades of green. Water bodies have low to no vegetation and are shown in a contrasting orange.
Normalized Difference Water Index (NDWI)
The Normalize Difference Water Index (NDWI) uses near-infrared radiation and visible green light to detect the presence of such features while eliminating the soil and terrestrial vegetation features. NDWI is useful for detecting water bodies. It is calculated as NDWI = (GREEN - NIR) / (GREEN + NIR) where Green is Band 3 and NIR is Band 8. Values greater than 0.5 usually correspond to water bodies.
Similar to how we plotted NDVI in the above section, you can use earthpy to plot NDWI for the observed region.
The visualization shows us the values plotted for the Lake Shasta region. You can see the lake area in blue. Land usually corresponds to much smaller values and is shown in shades between zero and 0.2.
Burn Area Index (BAI)
Burn Area Index (BAI) highlights burned land in the red to near-infrared spectrum by emphasizing the charcoal signal in post-fire images. The index is computed from the spectral distance from each pixel to a reference spectral point, where recently burned areas converge. Brighter pixels indicate burned areas.
In Southern California, where I live, wildfires are commonly occurring natural disasters. Below we have an example of the BAI analysis of the Silverado Fire of 2020. The post-fire image clearly shows the burn scars from the fire.
Wrapping Up
This post covered setting up an environment for geospatial analysis in SageMaker Studio Lab, the basics of geographic data analysis, searching and downloading geospatial images, manipulating rasters, and calculating spectral indices. SageMaker Studio Lab is free, and we did not create any billable AWS resources as part of this exercise. However, the geographic and GIS data files that are downloaded and the images generated may take up a considerable amount of storage. Make sure to check your storage utilization and clean up files as needed.