I’m happy to announce that I’ve recently published the velox R package. velox allows performing common raster operations in R much faster than alternative packages. The following plot provides a brief overview of the speed-up associated with using velox, as opposed to the well-known raster package:

benchmark

velox is on CRAN, so to install you can simply type

install.packages("velox")

in the R console. For more information, check out the velox website.

In this blog post I explore the question of how to color maps in a way that ensures optimal readability, while using a small set of colors. I also introduce the new MapColoring R package, which provides a practical solution to this problem in a user friendly manner.

The Problem

In my work, I often encounter situations where I need to plot a map of polygon features (such as countries, administrative units, or ethnic groups) simply to demonstrate the extent and quality of the spatial information. In other words, I don’t need to display any attribute data, but simply draw a (non-choropleth) map. How should I color the polygons? There are two easy, but often suboptimal options. First, I could use the same color for all features. This approach has the disadvantage that the individual regions are often difficult to tell apart, especially if there are islands or enclaves. Consider, for instance, the following map of Swiss cantons:

Spot the Cantons

Clearly, it is very difficult to distinguish between small cantons and enclaves (of which there are some). Second, I could use a distinct color for all features. This makes distinguishing between areas easier, but often looks just terrible:

switzerland_rb

A more effective solution would be one where we use a small number of colors, but ensure all regions are clearly and easily identifiable. This leads to an interesting question: How do we color a map with as few colors as possible while ensuring that no two adjacent features are of the same color?

Map Coloring and Graph Theory

It turns out that this problem has a fairly long history. Wikipedia informs us that British cartographer Francis Guthrie described the issue in 1852 when mapping English counties, and proposed what is known as the Four-Color-Theorem. Essentially, the latter states that (under certain conditions, discussed below) any map can be colored according to the above-described rules with at most four colors. Interestingly, the theorem was only proven in 1976 by mathematicians Kenneth Appel and Wolfgang Haken.

An important aspect of map coloring is that it is a specific case of a more general problem called graph coloring. Here, the goal is to color all vertices of a graph such that no vertices sharing an edge have the same color. In this context, the minimal number of colors with which a graph can be colored is called its chromatic number.

Our map-related problem can easily be restated in terms of graph theory: every polygon feature is a vertex, and edges between vertices represent adjacent polygons on the map. If transformed in this manner, most maps can be represented as planar graphs, that is, graphs that can be drawn such that no edges cross each other. In fact, the Four-Color-Theorem introduced above applies to all planar graphs, and states that the chromatic number of any planar graph is at most four.

Does this mean that in practice, all we ever need to color a map are four colors? Unfortunately not. The transformation from maps to planar graphs only works under certain conditions. First, we need to define adjacency to mean that two regions share at least one border segment of positive length. If we assume that regions are adjacent if they share only a point (or corner), then maps don’t automatically generate planar graphs. To see that this is the case, note that a hypothetical map of five regions that all share a common border point would result in the complete graph of five vertices, which is not planar. Second, we have no guarantee of receiving a planar graph if the map we transform features enclaves. We may again invoke a counterexample to prove that this is the case. Assume a map of five regions, whereas each region has an exclave in every other region. Again, this results in the (non-planar) complete graph of five vertices.

As an illustration that real-world maps don’t necessarily translate to planar graphs, see the following depiction of Swiss cantons as a network. Here I use the “standard” adjacency definition whereas only shared border segments of positive length qualify as adjacent:

switzerland_graph

Visual inspection suggests that there are crossing edges in central Switzerland, owing to the fact that the territory of Obwalden is not contiguous.

It follows that in order to color a wide range of maps, we require a coloring method for arbitrary graphs. Unfortunately, there is no equivalent to the Four-Color-Theorem for arbitrary graphs. Indeed, finding an arbitrary graph’s chromatic number and coloring it accordingly is NP-hard. Fortunately, however, there are a number of fast algorithms that yield satisfactory (albeit sometimes suboptimal) solutions to the graph coloring problem.

The MapColoring R Package

In the following, I describe an R package I created that uses the greedy DSATUR graph coloring algorithm by Daniel Brélaz to assign colors to a collection of polygons. In addition to simply solving the map coloring problem, the package also allows to select colors in a manner that ensures maximal visual contrast between adjacent features. Specifically, if provided a vector of candidate colors (for example from a palette) of length N and a set of polygon features, the “getOptimalContrast()” function selects KCEoptim function in the package of the same name.

The new MapColoring package’s core functionality is contained in the three functions

  • getNColors
  • getColoring
  • getOptimalContrast

To demonstrate these functions, I use the sp package to create a toy map in the shape of a chess board:


## Make chess board
library(sp)
gt <- GridTopology(c(0,0), c(1,1), c(8,8))
sg <- SpatialGrid(gt)
board <- as(as(sg, "SpatialPixels"), "SpatialPolygons")

The getNColors function uses the greedy DSATUR algorithm to calculate the number of colors required to color the provided SpatialPolygons* object such that no two adjacent features have the same color. Note that if provided a SpatialPolygons* object, the getNColors function (as well as the getColoring and getOptimalContrast functions) construct an adjacency matrix based on the assumption that only features sharing at least one border segment of positive length are adjacent – thus, features sharing only a corner are ignored. If you’d like to adopt a different adjacency rule, you can directly pass a logical adjacency matrix (instead of a SpatialPolygons* object) to the MapColoring functions. In the next code segment, I pass the chess board, which is a SpatialPolygons object, to the getNColors function.

 

## Get estimate for chromatic number
Ncol <- getNColors(board)

As we would expect, this returns a value of 2.

The getColoring function employs the greedy DSATUR algorithm to “color” the features of a SpatialPolygons* object (or an adjacency matrix). I write “color” because the function does not return actual colors, but an integer vector with a color index for each of the features in the provided map. Hence, if we pass the chess board to the getColoring function…

 

## Get coloring indices
coloring <- getColoring(board)

… we get an integer vector of length 64 containing the values 1, 2, 1, 2, 1, etc.

Finally, as explained above, the getOptimalContrast function goes a step further, and instead of only assigning color indices, it returns actual colors. Provided a map and a vector of candidate colors, it selects and allocates these colors in a manner such that they match the indexing returned by the getColoring function, and ensures that the average contrast between adjacent polygons is maximized. Color contrasts are measured using their Contrast Ratio, as defined here.

In the following code segment, I pass the chess board together with 20 colors from R’s Rainbow palette to the getOptimalContrast function:


## Get Optimal contrast colors
cand.colors <- rainbow(20)
opt.colors <- getOptimalContrast(x=board, col=cand.colors)

## Plot
plot(board, col=opt.colors)

The result looks as follows:chessInterestingly, of those colors passed to the function, the blue-yellow combination appears to have the highest contrast ratio, and indeed, the cells are very clearly distinguishable.

Finally, as a more realistic example, I use the getOptimalContrast function to plot a world map, obtained from the cshapes package. Moreover, instead of relying on the adjacency definition employed above, I define two countries to be adjacent if their borders are within 4 decimal degrees of each other. One reason one might want to do this is to avoid confusion over the status of islands. If the adjacency definition used above is adopted, an island and a proximate region on the mainland may be assigned the same color even though they do not belong to the same country because they do not share a border. Finally, as the set of candidate colors, I use a palette from the RColorBrewer package.


## Country borders in 2012
library(cshapes)
world.spdf <- cshp(as.Date("2012-06-30"))

## Use loose definition of adjacent
library(rgeos)
adj.mat <- gIntersects(gBuffer(world.spdf, width=2, byid=TRUE), byid=TRUE)

## Get Optimal contrast colors
library(RColorBrewer)
cand.colors <- brewer.pal(11, "RdGy")
opt.colors <- getOptimalContrast(x=adj.mat, col=cand.colors)

## Plot
plot(world.spdf, col=opt.colors)

The result of this exercise looks like this:

worldmap

 

It’s been a while since my last post, but finally there’s something to report! I’ve created a new GitHub repository where I maintain tools for map classification in Python.

Check it out here.

As much as I like working with R, coding in Python has proven a much faster and more convenient alternative for working with image data, primarily because it allows using the excellent OpenCV toolkit. Currently, the new GitHub repository hosts a loosely connected set of Python modules that are useful for automatic map classification. Among them is a PixelClassifier class, which essentially implements a Python version of the supervised classification documented here, as well as some preliminary tools for character detection and recognition. I’m currently working on additional tools for text removal, as well as line and shape detection, which I’ll make available once they work reliably.

 

There are numerous excellent packages to process and analyze spatial data in R. However, when a project involves multiple and/or large spatial data sets, it often makes sense to make use of a spatial database for data management.

I use a PostGIS database (which is an extension to the PostgreSQL system) for storing and organizing spatial data for my projects. Unfortunately, transferring spatial vector data between R and PostGIS may be a bit of a hassle if you’re not working on a Linux machine.

If you are working with a Linux distribution, it appears that you can use the OGR “PostGIS” driver that comes with rgdal, see here. Unfortunately, the OGR “PostGIS” driver doesn’t seem to be prepacked in the rgdal installation for Windows or Mac (and likely never will be), so you’ll either have to manually install the driver (which I didn’t manage to do…), or rely on a different approach.

One alternative to using the OGR driver is to convert the spatial data into text and transfer the data using a non-spatial PostgreSQL driver for R. I’ve written two very simple functions, one to upload spatial vector data from R to a PostGIS database, and one to do the opposite, based on this method. Since they’ve served me well until now, I thought they may also be helpful for others.

The dbWriteSpatial function converts the spatial information into the WKT format using the writeWKT function from the rgeos package, transfers the data to the DB using the PostgreSQL driver from the RPostgreSQL package, and then re-creates the geometries on the PostGIS database.

require(rgeos)
require(RPostgreSQL)

dbWriteSpatial <- function(con, spatial.df, schemaname="public", tablename, replace=FALSE) {
 
  # con:          A PostgreSQL connection (from RPostgreSQL)
  # spatial.df:   A Spatial Data Frame object
  # schemaname:   Target schema name
  # tablename:    Target table name
  # replace:      Replace the target table if it already exists
 
  # Create well known text and add to spatial DF
  spatialwkt <- writeWKT(spatial.df, byid=TRUE)
  spatial.df$wkt <- spatialwkt
 
  # Add temporary unique ID to spatial DF
  spatial.df$spatial_id <- 1:nrow(spatial.df)
 
  # Set column names to lower case
  names(spatial.df) <- tolower(names(spatial.df))
 
  # Upload DF to DB
  data.df <- spatial.df@data
  rv <- dbWriteTable(con, c(schemaname, tablename), data.df, overwrite=replace, row.names=FALSE)
 
  # Create geometry column and clean up table
  schema.table <- paste(schemaname, ".", tablename, sep="")
  query1 <- paste("ALTER TABLE ", schema.table, " ADD COLUMN the_geom GEOMETRY;", sep="")
  query2 <- paste("UPDATE ", schema.table, " SET the_geom = ST_GEOMETRYFROMTEXT(t.wkt) FROM ", schema.table, " t  WHERE t.spatial_id = ", schema.table, ".spatial_id;", sep="")
  query3 <- paste("ALTER TABLE ", schema.table, " DROP COLUMN spatial_id;")
  query4 <- paste("ALTER TABLE ", schema.table, " DROP COLUMN wkt;")
  er <- dbSendQuery(con, statement=query1)
  er <- dbSendQuery(con, statement=query2)
  er <- dbSendQuery(con, statement=query3)
  er <- dbSendQuery(con, statement=query4)
 
  return(TRUE)
}

The dbReadSpatial function converts the vector data into the WKT format on the DB, transfers the data into a data frame using the driver from the RPostgreSQL package, saves the data frame in CSV format in the temporary directory of the current R session on the local machine, and then imports the CSV as a Spatial Data Frame object using the appropriate OGR driver in the rgdal package. I’ve implemented the detour via the CSV file because the OGR CSV driver is very fast in converting the WKT information into the appropriate sp object. The alternative, using the readWKT function from the rgeos package, is much slower in most cases, since the readWKT function only accepts one string at a time, and thus necessitates a tedious looping structure.

require(rgeos)
require(rgdal)
require(RPostgreSQL)

dbReadSpatial <- function(con, schemaname="public", tablename, geomcol="the_geom", idcol=NULL) {
 
  # con:          A PostgreSQL connection (from RPostgreSQL)
  # spatial.df:   A Spatial Data Frame object
  # schemaname:   Target schema name
  # tablename:    Target table name
  # geomcol:      Name of the geometry column in the target table (target table may not have more than one geometry column!)
  # idcol:        Name of the column with unique IDs to be used in the ID slot of the spatial objects (not relevant for point data)
 
  ## Build query and fetch the target table
  # Get column names
  q.res <- dbSendQuery(con, statement=paste("SELECT column_name FROM information_schema.columns WHERE table_name ='", tablename, "' AND table_schema ='", schemaname, "';", sep=""))
  schema.table = paste(schemaname, ".", tablename, sep="")
  q.df <- fetch(q.res, -1)
  # Some safe programming
  if (!(geomcol %in% q.df[,1])) {stop(paste("No", geomcol, "column in specified table."))}
  if (!is.null(idcol)) {
    if (!(idcol %in% q.df[,1])) {stop(paste("Specified idname '", idcol, "' not found.", sep=""))}
  }
  # Get table
  query <- paste("SELECT", paste(q.df[,1][q.df[,1] != geomcol], collapse=", "), paste(", ST_ASTEXT(", geomcol, ") AS the_geom FROM", sep=""), schema.table, ";")
  t.res <- dbSendQuery(con, statement=query)
  t.df <- fetch(t.res, -1)
 
  ## Get geometry ID column number
  if (!is.null(idcol)) {
    idcolnum <- which(names(t.df) == idcol)
  } else {
    t.df$id.new <- 1:nrow(t.df)
    idcolnum <- which(names(t.df) == "id.new")
  }
 
  ## Get geometry column number
  geomcolnum <- which(names(t.df) == geomcol)
 
  ## Build spatial data frame using OGR
  write.df <- t.df[,geomcolnum,drop=FALSE]
  names(write.df) <- "WKT"
  filename <- paste("vector_", as.character(format(Sys.time(), "%H_%M_%S")), sep="")
  filename.csv <- paste(filename, ".csv", sep="")
  write.csv(write.df, paste(gsub("[\\]", "/", tempdir()), "/", filename.csv, sep=""), row.names=TRUE)
  down.spdf <- readOGR(dsn=paste(gsub("[\\]", "/", tempdir()), "/", filename.csv, sep=""), layer=filename, verbose=FALSE)
  rv <- file.remove(paste(gsub("[\\]", "/", tempdir()), "/", filename.csv, sep=""))
  data.df <- data.frame(t.df[,-geomcolnum])
  names(data.df) <- names(t.df)[-geomcolnum]  

  # For Spatial Points Data Frame  
  if (grepl("POINT", t.df[1,geomcolnum])) {
    spatial.df <-  SpatialPointsDataFrame(down.spdf@coords, data.df, match.ID=FALSE)
  }
  # For Spatial Polygons/Lines Data Frame    
  if (grepl("POLYGON", t.df[1,geomcolnum]) | grepl("LINE", t.df[1,geomcolnum])) {
    spatial.df <- down.spdf
    spatial.df@data <- data.df
    spatial.df <- spChFIDs(spatial.df, paste(t.df[,idcolnum]))
  }
  return(spatial.df)
}

A session might look something like this:

# Get connection
con <- dbConnect(dbDriver("PostgreSQL"), dbname="postgis", port=5432, host="localhost", user="postgres", password="pw")

# Write myspatialdata.spdf to PostGIS DB
dbWriteSpatial(con, myspatialdata.spdf, schemaname="myschema", tablename="myspatialdata", replace=FALSE)

# Get myschema.myspatialdata from PostGIS DB
myspatialdata.spdf <- dbReadSpatial(con, schemaname="myschema", tablename="myspatialdata", geomcol="the_geom")

The functions do not handle projection information at this time, though I might add this later… Let me know if you detect any bugs or have any suggestions for improvement!

For a project hosted at the ETH Risk Center, a fellow PhD student and I are currently trying to extract and vectorize a large number of cartographic features from a collection of scanned maps. Though the project is still in an early phase, we were already able to do a lot of interesting stuff, and I was particularly surprised about how far we got using R (which is not known for its image processing capabilities) as our primary programming tool.

One of the steps for which R turned out to be a particularly suited option is image classification, i.e., separating the pixels representing the feature of interest (e.g. roads) from everything else. In this first post, I’ll try to demonstrate how easily one can perform some basic image classification in R. To this end, I’ll use the example of extracting pixels representing oil pipelines from a map of oil infrastructure in Iraq (which is not directly part of the project mentioned above, but still relevant to my research).

Image Classification

The problem we’re dealing with here is essentially the following: On the basis of a raster image (which a scanned map eventually is), categorize each pixel into two or more classes. This is a well-known problem in remote sensing, where typically multispectral satellite imagery is processed and pixels are assigned to land-cover classes (different types of vegetation, built-up areas, etc.). Luckily, for our application here, we’re facing a very simple variant of this: our base material (the scanned map) is already highly standardized (satellite imagery usually doesn’t come with a legend…), and the information we use for classification is only 3 dimensional (the three color bands making up the RGB image).

Image classification procedures usually fall into one of two categories. In supervised classification, we first fit a model to manually pre-classified training data (i.e., the model is initially provided the “right” answers, thus “supervised”), which we then use to classify all other pixels in the image. In unsupervised classification, data is clustered into categories with little or no prior knowledge about how the latter should look like. For our pipeline problem, we are clearly going for supervised classification, since the categories of interest are perfectly well defined from the outset (pipeline-pixels versus the rest), and, as we will see below, creating training data to fit the model is trivial.

Support Vector Machines

The remote sensing literature suggests a wide array of well-established methods for performing supervised classification, but in this post I’ll demonstrate one of the more recent alternatives. Specifically, I will use support vector machines (SVMs) for classification. SVMs are linear classifiers with very desirable properties, and are apparently one of the hottest topics in machine learning. Admittedly, I only have basic knowledge of the method, and I’m still not fully comfortable with the underlying math, but the intuition behind SVMs is essentially to place a separating hyperplane into the feature-space such there is a maximal margin between the hyperplane and the data points of either class. Additionally, there’s some kernel magic going on that allows for non-linear classification, but I’m not competent enough to discuss the details. If you’re interested in the topic, I recommend the respective chapter in Hastie et al.’s machine learning bible (see http://stanford.io/hmBvL), which features a thorough introduction to SVMs, and is particularly readable if you have a background in econometrics.

Implementation in R

So let’s try and implement this in R. The image to be classified is the following map of Iraqi oil infrastructure in 2003, which I obtained from the Perry-Castaneda library map collection (http://bit.ly/xlvSN), and was apparently first published in the 2003 CIA country profile of Iraq:

Iraqi Oil Infrastructure in 2003

Iraqi oil infrastructure in 2003

Quite obviously, for this image in particular, simply vectorizing the pipelines by hand would be much faster than relying on an automated approach.  For maps with higher resolution and more detailed features, however, automation pays off quickly.

Loading the scanned map (which is in JPEG format) and getting a look at it is easily achieved using the Raster package:

# Load and plot the scanned map
require(raster)
iraq.img <- brick("iraq_oil_2003.jpg")
plotRGB(iraq.img)

Further, apart from the original image, we need some pre-classified samples of pipeline pixels and background pixels for the initial fitting of the SVM. In order to obtain these, I simply opened the original map in GIMP and created two JPEG images, one containing some randomly selected pipeline sections, and one containing some patches of background. These, too, can be loaded into R using the Raster package:

# Load images with the (pre-classified) training data
training.ppl.img <- brick("iraq_pipelines_train.jpg")
training.noppl.img <- brick("iraq_nopipelines_train.jpg")

And this is what the training data looks like:

Raster images with training data

Raster images with training data

In a next step, we transform these pixel images into properly formatted data frames, such that they can be used as training data. For this purpose, I get the 3-dimensional pixel values with the getValues function in the Raster package, pack them in a data frame, and remove all rows representing white background pixels. Furthermore, we assign a factor variable “pipeline”:

# Put training data into data frame
training.ppl.df <- data.frame(getValues(training.ppl.img))
names(training.ppl.df) <- c("r", "g", "b")
# Remove white background pixels
training.ppl.df <- training.ppl.df[(training.ppl.df$r < 254 & training.ppl.df$g < 254 & training.ppl.df$b < 254),]
# Create new variable indicating pipeline pixels
training.ppl.df$pipeline <- "P"
# Do the same for the non-pipeline image
training.noppl.df <- data.frame(getValues(training.noppl.img))
# etc...

At this point, it is useful to have a look at the just created data frames with the training data. It turns out that while we have only 120 classified pipeline pixels, we have over 20’000 background pixels. This isn’t a problem per-se, but for the sake of computational performance, I will only perform model fitting using a random sample of 5’000 background pixels (which does not seem to affect the quality of the end result):

# Combine data frames and subset only 5000 random values from the non-pipeline training data
training.df <- rbind(training.ppl.df, training.noppl.df[sample(nrow(training.noppl.df), 5000),])
# Turn classification variable into factor
training.df$pipeline <- as.factor(training.df$pipeline)

Now, before we fit an SVM to all the training data, we might want to have some indication about how well image classification is going to work with this setup. For this purpose, I divide the training data into a training subset and a testing subset, fit an SVM to the training subset, and calculate the model’s predictive accuracy using out-of-sample predictions on the testing subset. To fit the SVM, I rely on the e1071 package, which is said to feature the fastest SVM implementation for R.  I use the default settings (which use radial-basis kernels to allow for non-linear classification), and perform a grid-search for the gamma and cost parameters in order to pick the best model by minimizing classification error via 10-fold cross-validation. As mentioned before, we classify purely on the basis of pixel color, hence using the RGB bands as predictor variables:

# Divide training data into a train-subset and a test-subset
train <- sample(nrow(training.df), round((nrow(training.df) - 1) / 2, 0))
test <- c(1:nrow(training.df))[!(c(1:nrow(training.df)) %in% train)]
trainset.df <- training.df[train,]
testset.df <- training.df[test,]
# Fit best SVM using tuning
require(e1071)
svm.fit <- best.svm(pipeline~., data = trainset.df, gamma = 10^(-6:-1), cost = 10^(-1:1))
# Fit predictions and print error matrix
svm.pred <- predict(svm.fit, testset.df[,1:3])
svm.tab <- table(pred = svm.pred, true = testset.df[,4])
print(svm.tab)
Out-of-sample error matrix

Out-of-sample error matrix

The resulting error matrix (which may vary due to the random sampling above) shows that the SVM does not perform perfectly, but ok: There are no false positives, but a fair amount of false negatives, indicating that the model is rather conservative in coding pipelines.

So let’s get to the real thing: First, we fit an SVM to the entire training data set, and then we use the fitted SVM for predicting the class of each pixel in the original map. In order to visualize the result, we create a copy of the raster object containing the original map, and assign zeros and ones (instead of the original RGB bands) to the pixels on the basis of the pipeline/background predictions obtained by the SVM:

# Fit tuned SVM to entire training set
svm.fit <- best.svm(pipeline~., data = training.df, gamma = 10^(-6:-1), cost = 10^(-1:1))
# Prepare Iraq map for predictions
iraq.df <- data.frame(getValues(iraq.img))
names(iraq.df) <- c("r", "g", "b")
# Assign predicted values to target map
iraq.pred <- predict(svm.fit, iraq.df)
iraq.class <- ifelse(iraq.pred == "P", 1, 0)
classified.img <- iraq.img[[1]]
values(classified.img) <- iraq.class

Finally, I plot the classified image using the image function (from the graphics package), and slightly change the color settings in order to optimize visibility of the pipeline pixels. I use image instead of the plot function from the Raster package for esthetic reasons (see http://bit.ly/Y23odn):

Classified Iraqi pipelines

Classified Iraqi pipelines

The result surely isn’t perfect, but given the difficulty of the problem (classifying green pipelines in a map where essentially everything is green), it’s more than acceptable. Though the SVM appears to systematically underestimate the number of pipeline pixels, as already indicated by the non-trivial number of false negatives in the error matrix generated above, the pipelines are clearly visible, and further image processing should very well be possible.