Chapter 9 Geo-Spatial Analysis and Visualizations
Geo-spatial data is the easiest to make a methodological mistake with - we’ll be layering data points on top of a map, indicating location, and we need to make sure our visualizations successfully show the right data. It’s quite easy to make a mistake.
It’s also common to see errors in the data itself. When I map crimes in Philadelphia, I often get a cluster of data points concentrated in one area of Florida. Why? Someone made a mistake when entering the latitude and longitude points, and repeated the mistake multiple times.The most common errors with latitude and longitude are reversing the values and forgetting to make them positive/negative.
9.1 Working With Geo-Spatial Data
Most data software has the ability to recognize cities, states, and countries by name. In R, you can simply type:
state.name
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
R will print them all out for you. It also has state abbreviations built in,with ‘state.abb.’
With just the names of the states and a data point for each one, we could create what’s called a choropleth map, one made up of colored regions that indicate the data level. Whereas an election map will show each state as either red or blue, a choropleth map usually instead uses a range of one color to indicate a level: the more ‘red’ a state is, the more homicides per capita happened there, for instance. Which leads us to the most common geo-spatial error: not accounting for population. If I told you there were more gun deaths in California in 2020 than in Mississippi, I could show data like this:
State | Firearm_Deaths |
---|---|
California | 3,449 |
Mississippi | 818 |
(Data from the CDC)
Does California have a bigger gun problem than Mississippi? That would be the incorrect conclusion. When comparing states, they each have a different population - so we have to calculate a rate. That is, to compare states with different populations, we have to calculate a ‘rate per x’ for our data - in this case, the rate of firearm deaths per capita:
population / firearm_deaths = rate
The same would go for countries, cities, or any other data related to a particular location.
So where can we get population data? It’s usually quite easy to find. We could, in theory, merge this state data with our firearm_deaths data frame - the only requirement for merging is that each data frame has a shared column (in this case, it would probably be ‘State’). Once merged, we could use the mutate function to calculate the rate. (deaths / population)
In the case of this data, though, the CDC has already calculated a rate based on population. Looking at their table shows that California ranks 44th in firearm death rate, and Mississippi is 1st:
State | Death_Rate |
---|---|
California | 8.5 |
Mississippi | 818 |
The point being, the most common error in geo-spatial analysis is not taking population into account.
9.2 Geo-Spatial Data Types
Geo-spatial data can come in a number of formats, the most common being:
- geoJSON, or JSON (JavaScript Object Notation) with a ‘geometry’ column.
- A CSV, where columns indicate ‘geometry,’ ‘latitude,’ ‘longitude,’ or location equivalencies that can be recognized by R
- A Shapefile, which ‘draws’ boundaries by connecting lat/long points
Geo-spatial plotting also require a base layer - the map itself - along with data that is superimposed on top of the map. This data is either built in to the dataset you’re using, or must be merged together with your data.
9.3 Geo-Spatial R Packages
There are loads of different approaches to visualize geo-spatial data in R. For consistency and ease, we’ll use Leaflet, which allows us to chain together steps using the ‘%>%’ pipe operator, just like the tidyverse. Let’s also load a package called ‘sf,’ or ‘simple features,’ which makes it very easy to import geoJSON files as dataframes, another called ‘sp’ that makes loading Shapefiles a breeze, and the ‘maps’ package that helps with our map ‘projection,’ or how our map deals with the curvature of the Earth.
{{r, geopackages}} install.packages('leaflet') install.packages('sf') install.packages('sp') install.packages('maps')
library(leaflet)
library(sf)
library(sp)
library(maps)
Here’s the example code from the Leaflet Leaflet for R:
<- leaflet() %>%
m addTiles() %>% # Add default OpenStreetMap map tiles
addMarkers(lng=174.768, lat=-36.852, popup="The birthplace of R")
# Print the map m
What does that say? Well, first of all, we’re creating a variable, and the map visualization will be equal to the variable - so to see it, we just need to type the name of the variable. We tell R to use leaflet(), then add a background image - this one is supplied by Open Street Map - then add a marker to the map at a specified location, and finally add a popup to the marker that indicates it’s the location where R was created.
Great! If that works, let’s change the values and make our own map. I’ll center mine around the Golden Gate Bridge, and add a marker for it as well.
What’s the latitude/longitude pair for this location? I like to use Google Maps to get that info. If I go to Google Maps and enter ‘Golden Gate Bridge,’ I get this: https://www.google.com/maps/place/Golden+Gate+Bridge/@37.8199286,-122.4804438,17z/
…and a lot of other gibberish - but the latitude and longitude are literally inside that URL, although it can be hard to see them among all the other junk (they are just after the ‘@’ symbol). Another way to get the latitude and longitude values pair is by clicking on the map itself to create a marker; a popup at the bottom of the screen should show the latitude / longitude pair.
Let’s try re-using the above Leaflet code, but changing the location and popup content:
<- leaflet() %>%
m addTiles() %>%
addMarkers(lng=-122.478534, lat=37.819988, popup="Golden Gate Bridge")
m
Okay, we are able to reproduce the demo code with a new location. Now let’s get into plotting data on maps.
9.4 Adding External Data
Now that we have the (very) basics of leaflet down, let’s try to load some geoJSON and plot it. geoJSON is the easiest geo-spatial format to work with, as it acts essentially as both a data frame and a geo-spatial one, to oversimplify a bit. So we can use dplyr’s data-cleaning tools on it - filter, summarize, mutate, and so on. We can merge it with other data sets, be they spatial or non-spatial.
Let’s get geoJSON that defines the county boundaries of California:
https://gis.data.ca.gov/datasets/CALFIRE-Forestry::california-county-boundaries/explore
Click on the ‘download’ icon (a cloud with an arrow) and choose to download the GeoJSON file.
To load the file into R, we’ll use the read_sf() function of the sf package. Please note that the code here does not include the path to the file; if, for instance, the downloaded file is in your ‘Downloads’ folder on a Mac, the path would be:
“~/Downloads/California_County_Boundaries.geojson”
The tilde ( ~ ) translates as ‘starting in the User’s home folder.’ You could replace ‘Downloads’ with ‘Desktop’ or anything else as needed. If having issues, check the chapter on Errors.
<- read_sf("California_County_Boundaries.geojson") ca
Define a color palette
<- colorNumeric("viridis", ca$OBJECTID, 58) pal
(have to describe much more about picking a palette here. also link to documentation)
Let’s plot it!
leaflet(ca) %>%
addTiles() %>%
addPolygons(stroke = FALSE,
smoothFactor = 0.9,
fillOpacity = 0.7,
fillColor = ~pal(ca$OBJECTID),
label = ~paste0(ca$COUNTY_NAME)
)