Markov Cluster

Introduction

I have used the Capital Bikeshare data to investigate a few ideas in the past and it has worked out quite well. It is a pretty useful set of data for exploring algorithms and ideas. One thing that I have been interested in lately has been how to cluster sequences. What I mean by this is how do I take something that moves from one state or place to another over some range of time and find groups of locations that entities oscillate between frequently.

I found an interesting package that does this using Markov Chains. I was already thinking that this would be an interesting approach. In the process of googling around to find some info to get started I found a package that has already done this. The package is called MCL and you an find more info about it here.

So lets dive in. Read the file, and keep relevant data. I really like how easy dplyr makes it to clean data. It is both fast, minimal and intutitive.

options(stringsAsFactors = FALSE)
library(dplyr)
library(lubridate)
library(ggplot2)
library(MCL) 
library(xml2)

'2014-Q2-Trips-History-Data.csv' %>% 
  read.csv %>%
  select(sdate = Start.date, edate = End.date, 
         sterm = Start.terminal, eterm = End.terminal, 
         bike  = Bike., type  = Subscriber.Type) %>% 
  filter(type == 'Registered') -> bike

In order to run this algorithm we need to convert this data into a different form. We currently have it in a transactional form.

head(bike)

##             sdate         edate sterm eterm   bike       type
## 1 6/30/2014 23:59 7/1/2014 0:12 31254 31623 W21070 Registered
## 2 6/30/2014 23:56 7/1/2014 0:00 31264 31232 W20474 Registered
## 3 6/30/2014 23:56 7/1/2014 0:04 31201 31509 W20754 Registered
## 4 6/30/2014 23:55 7/1/2014 0:16 31111 31631 W21057 Registered
## 5 6/30/2014 23:55 7/1/2014 0:01 31233 31203 W21192 Registered
## 6 6/30/2014 23:54 7/1/2014 0:03 31227 31268 W20220 Registered

You can see we have when and where the trip started and ended. We also have which bike it was and the type of account the user has. The first thing we need to do is get a list of the terminals in the data set.

c(bike$sterm, bike$eterm) %>% unique %>% na.omit -> term

With this list of terminals and the trip data we need to create a transition matrix. This means we need a square matrix that has each terminal as a row and column. The simplest way to do this, not the most efficient, is to initialize a matrix of all zeros and loop through each terminal and add the the count for each destination given that starting terminal. This will take a minute.

# This initializes the transition matrix
t_mat <- matrix(0, ncol = length(term), nrow = length(term))

# Loop to get count of all station to station trips.
for (i in 1:length(term)) {
  # Subset local dataset for speed.
  tmp <- bike[bike$sterm == term[i], ]
  t_mat[i, ] <- sapply(term, function(x) sum(tmp$eterm == x, na.rm = T))
}

Now we are ready to use the MCL package.

clus <- mcl(t_mat, F)

We get three things from this function. We get the number of clusters it creates, which makes it much more useful than something like k-means clustering where we need to have some idea before hand as to how many clusters there should be in the data. We also get the number of iterations it took to to converge and the actual cluster which a terminal belongs.

clus

## $K
## [1] 17
## 
## $n.iterations
## [1] 23
## 
## $Cluster
##   [1]  4  1  4  4  4  4  4  1  1  4  4  4  4  1  1  1  0  4  1  1  4  4  3
##  [24]  1  4  4  7  4  4  4  4  4  4  1  4  4  9  7  7  4  1  4  1  1  1  4
##  [47]  4  4  4  4  1  7  4  4  4  3  4  4  1  4  6  4  4  4  4  4  4  1  1
##  [70]  5  7  1  1  1  1  1  6  1  7  4  1  7  4  4  9  4  4  4  1  4  4  4
##  [93]  9  4  1  9  1 12  1  4  3  3  1  1  7  4 12  4  1  1  3  4  1  4  1
## [116]  4  1  1  1  4  1  9  4 11  1 12  1  1  4  1  1  1  1  9  4  8  3 11
## [139]  1  7  1  1  4  9  4  8  3  4  9  1  1  3  1  1 10  4  4  7  4  1  9
## [162]  1  1  1 10  4  1 10  6  1  1  7  9 12  9  4  7  6  4  4  7  4  4  1
## [185]  4 10  9  1  3  1 11 10  3  9  4  5  4  9  9  3  1 10  1  5  4  7  5
## [208]  3  1  3  4 11  6 11  1  1  7  3  6  5  9  7 10 10  9  3 10  3 15  1
## [231]  6  1  4 13 14  5  4 10  7  5 12  3  6 14  3 14  7 11 14  7  7 12 11
## [254]  5  3  6  8 12  8  3  6  5  7  1 13  7 12  6  1  1  3 14  6  7 14 12
## [277] 14  1  5  7  7  5  3 12  1  3 12 11 17 14 10  6  6  6  5 15 16 13  1
## [300]  3  7  6  5  7 16 12 14  7  6 14 17 14  1 17 15 17 14 14 14 14  0

Lets add this new piece of information to the terminals.

station <- data.frame(terminalName = as.character(term), 
                      clus_id = clus$Cluster)

We are also quite lucky here for a few reasons. Many things that have this type of sequence where they bounce around in a manner akin to a Markov Chain are hard to plot. There are a few ways but mostly I have seen it done by actually coloring the Markov Chain nodes. Since these are locations we could plot them on a map. We are also lucky that there is a source provided that will give us the coordinates to plot each terminal on a map. Here is a function that will get the live data from each of the terminals, how many bikes are at each among other useful info, but for now we just need the latitude and longitude of each terminal.

get_locs <- function() {
  # Get data from web with locations
  'http://www.capitalbikeshare.com/data/stations/bikeStations.xml' %>%
    read_xml %>%
    xml_find_all('.//station') -> stations
  
  . %>% paste0('.//', .) %>% xml_find_all(stations, .) %>% xml_text -> get_vals
  
  c('terminalName', 'lat', 'long', 'nbBikes', 
    'nbEmptyDocks', 'lastCommWithServer') %>%
    sapply(get_vals) %>% 
    data.frame %>% 
    filter(terminalName != "8D OPS test") %>%
    mutate(lat = as.numeric(lat), long = as.numeric(long))
}

We can join this data with our above cluster data in order to plot it.

get_locs() %>% 
  select(terminalName, lat, long) %>%
  inner_join(station, by = 'terminalName') %>%
  na.omit() -> stations

This stations data set is in a pretty tidy format now and is ready to be visualized.

head(stations)

##   terminalName      lat      long clus_id
## 1        31000 38.85610 -77.05120       7
## 2        31001 38.85725 -77.05332       7
## 3        31002 38.85640 -77.04920       7
## 4        31003 38.86017 -77.04959       7
## 5        31004 38.85787 -77.05949       7
## 6        31005 38.86230 -77.05994       7

We can use the the geom_point function and color by cluster.

ggplot(stations, aes(long, lat)) + geom_point(aes(colour = factor(clus_id)))

From a qualitative perspective this looks pretty good. Regions that are further away from each other tend to create there own clusters. You can also see of the roads that people travel on. I think a cool source of data to add to this would be metro loactions. Are clusters centered around these. Peole get off the metro and bike to there job?

This data has another interesting feature that I thought would be interesting to exploit. The main data file was constructed from a time period of three months in 2014. The location data comes from a live source that has terminals that did not exist then. We thus have new stations.

# Which ones have been added over the last year.
geo <- get_locs()[, 1:5]
geo[!geo$terminalName %in% stations$terminalName, ]

##     terminalName      lat      long nbBikes nbEmptyDocks
## 321        31314 38.90358 -77.06779       8            9
## 322        31081 38.80438 -77.06087       8            7
## 323        31079 38.89494 -77.09169       6            9
## 324        31078 38.86944 -77.10450       5            5
## 325        31274 38.89840 -77.02428      14            8
## 326        31080 38.89761 -77.08085       6            8
## 327        31275 38.90176 -77.05108       3           16
## 328        31082 38.80111 -77.06895       9            5
## 329        31083 38.82175 -77.04749       9            2
## 330        31084 38.80268 -77.06356       5           14
## 331        31085 38.82006 -77.05762       4            7
## 332        31086 38.82595 -77.05854       9            6
## 333        31087 38.82093 -77.05310       9            5
## 334        31088 38.83308 -77.05982       5           10
## 335        31089 38.89061 -77.08480       8           13
## 336        32049 38.98674 -77.00003       8            7
## 337        31090 38.86470 -77.04867       7            7
## 338        31315 38.96497 -77.07595       9           10
## 339        31276 38.90381 -77.03493       0           22
## 340        31277 38.89841 -77.03962      11           19
## 341        32050 38.99765 -77.03450       5           10
## 342        31278 38.91265 -77.04183       6           12
## 343        31091 38.85925 -77.06328       9            5
## 344        31517 38.90801 -76.99698       9           14
## 345        31092 38.89593 -77.10525       5           10
## 346        31093 38.89898 -77.07832       5           10
## 347        31094 38.89538 -77.09713      10            5
## 348        31279 38.89841 -77.04318      10           13
## 349        31095 38.86247 -77.06824      10            5
## 350        31280 38.91376 -77.02702       7           16

new <- merge(geo, station, all = T)
new$new <- ifelse(is.na(new$clus_id), 1, 0)

Here we can gt a look at where the new stations were added.

ggplot(new, aes(long, lat)) + geom_point(aes(colour = factor(new)))

Mike Bostock wrote a great post that gave me plenty of inspiration for this. Could we see how effectively the new stations are placed. We can create a Voronoi Diagram from this and see if the distributions of areas decreases with the new stations. I found two different packages that do this in R, so i figured I would try both of them out.

library(deldir)
library(tripack)

par(mfrow = c(1, 2))

stations %>% select(x = long, y = lat) -> before
get_locs() %>% select(x = long, y = lat) -> after

plot(tile.list(deldir(before$x, before$y)), close = TRUE)
plot(tile.list(deldir(after$x, after$y)), close = TRUE)

plot(voronoi.mosaic(before$x, before$y, duplicate = "error"))
plot(voronoi.mosaic(after$x, after$y, duplicate = "error"))

We can only see slight differences in both of them. After seeing these I have two different thoughts

This approach is not valid because they really should not be uniformly distributed. There are places where there is a greater need, even though they already have a station fairly close. The second thought is that these both look pretty rough. Even though this is not needed this is a plot that I think would work much better in d3.

To do this I need to write the data out to a csv file.

norm <- function(v) (v - min(v)) / (max(v) - min(v))

tile.list(deldir(before$x, before$y)) %>%
  lapply(function(x) data.frame(x = x$x, y = x$y)) %>%
  bind_rows %>%
  unique %>%
  mutate(x = norm(x), y = norm(y)) %>%
  write.csv(file = 'before.csv', row.names = F)

Each region contains an area that anywhere inside is closest to the station that it encloses. So regions that are larger are areas where there are fewer stations. If you move the mouse around you can see how much it would impact that area by placing a station at that point.

I think another method is needed to determine the best placement of new stations. One thought was that you could use the live data to see where there is the most demand, stations which are always empty. This may also just mean that you need to add more bikes at this location. This part of the problem may need more thought. I thnk the clustering worked quite well though.

Markov Clustering and Voronoi

Kenny Darrell

May 25, 2015

Introduction