1 Questions

One man’s wilderness is another man’s theme park. —— Author unknown

New York City is heaven for many, but hell for some, especially those who were affected by the seven “sins”:

Safety is one of the most fundamental human needs. This report investigats 7 sins, a.k.a, felonies, in New York City occurred in the past ten years (2006-2015), and focuses on answering the following questions:

  1. Is NYC becoming a safer city over the last 10 years?

  2. Which months in one year can be considered as unsafe?

  3. Which days in a week can be considered as unsafe?

  4. Which hours in a day can be considered as unsafe?

  5. Which places are more unsafe than others?

2 Dataset

The NYPD 7 Major Felony Incidents dataset contains Seven Major Felonies that is quarterly updated at the incident level. It was made public at Dec 29, 2015, and is available here.

According to the NYPD Incident Level Data Footnotes:

The first point indicates that the number of actual incidents is larger than that in the dataset. Since we know nothing about which types of offenses are typically associated together in incidents of multiple offenses, we make no assumption on it. The second point affects the accuracy of incident locations, but at the scale of borough or city level, the inaccuracy in longitude and latitude will not have major impact on overall distribution of incidents on map, since we are more interested in the overall metric, such as total number and density of murder in Manhattan.

3 Preprocess

The dataset is 194M in size, and contains around 1.2 million rows with 22 variables.

# install.packages("ff")
library(ff)
library(dplyr)

setwd("/Users/sundeepblue/Bootcamp/projects/explore_dataset")

# since the csv has around 1.1 million lines, we use ffdf for fast loading
t = read.csv.ffdf(file="NYPD_7_Major_Felony_Incidents.csv", header=TRUE, 
                  VERBOSE=TRUE, first.rows=10000, next.rows=50000, colClasses=NA)
## read.table.ffdf 1..10000 (10000)  csv-read=0.872sec ffdf-write=0.188sec
## read.table.ffdf 10001..60000 (50000)  csv-read=5.658sec ffdf-write=0.837sec
## read.table.ffdf 60001..110000 (50000)  csv-read=4.469sec ffdf-write=1.006sec
## read.table.ffdf 110001..160000 (50000)  csv-read=4.382sec ffdf-write=1.272sec
## read.table.ffdf 160001..210000 (50000)  csv-read=4.641sec ffdf-write=2.104sec
## read.table.ffdf 210001..260000 (50000)  csv-read=4.509sec ffdf-write=1.781sec
## read.table.ffdf 260001..310000 (50000)  csv-read=4.613sec ffdf-write=1.222sec
## read.table.ffdf 310001..360000 (50000)  csv-read=4.851sec ffdf-write=3.271sec
## read.table.ffdf 360001..410000 (50000)  csv-read=4.848sec ffdf-write=2.014sec
## read.table.ffdf 410001..460000 (50000)  csv-read=5.465sec ffdf-write=1.488sec
## read.table.ffdf 460001..510000 (50000)  csv-read=4.97sec ffdf-write=2.076sec
## read.table.ffdf 510001..560000 (50000)  csv-read=4.763sec ffdf-write=2.369sec
## read.table.ffdf 560001..610000 (50000)  csv-read=5.066sec ffdf-write=3.544sec
## read.table.ffdf 610001..660000 (50000)  csv-read=5.097sec ffdf-write=2.462sec
## read.table.ffdf 660001..710000 (50000)  csv-read=4.875sec ffdf-write=1.858sec
## read.table.ffdf 710001..760000 (50000)  csv-read=6.023sec ffdf-write=2.019sec
## read.table.ffdf 760001..810000 (50000)  csv-read=5.603sec ffdf-write=3.054sec
## read.table.ffdf 810001..860000 (50000)  csv-read=4.376sec ffdf-write=2.494sec
## read.table.ffdf 860001..910000 (50000)  csv-read=4.665sec ffdf-write=4.012sec
## read.table.ffdf 910001..960000 (50000)  csv-read=5.996sec ffdf-write=3.073sec
## read.table.ffdf 960001..1010000 (50000)  csv-read=5.204sec ffdf-write=3.051sec
## read.table.ffdf 1010001..1060000 (50000)  csv-read=4.518sec ffdf-write=2.729sec
## read.table.ffdf 1060001..1110000 (50000)  csv-read=4.816sec ffdf-write=2.56sec
## read.table.ffdf 1110001..1123465 (13465)  csv-read=1.19sec ffdf-write=1.669sec
##  csv-read=111.47sec  ffdf-write=52.153sec  TOTAL=163.623sec
# convert ffdf type to data.frame for later conversion
t = as.data.frame(t)

# convert some columns with unnecesary factor type into character type
t$Identifier = as.character(t$Identifier)
t$Occurrence.Date = as.character(t$Occurrence.Date)
t$Location.1 = as.character(t$Location.1)

# we are only interested in offenses in last 10 years [2006, 2015]
t_2006_to_2015 = filter(t, Occurrence.Year >= 2006)
# ignore observations with missing value
t_2006_to_2015 = na.omit(t_2006_to_2015)

# re-level factor variable "Day.of.Week"
t_2006_to_2015$Day.of.Week = factor(t_2006_to_2015$Day.of.Week, 
                                    levels=c("Monday", "Tuesday", "Wednesday", 
                                             "Thursday", "Friday", "Saturday", "Sunday"));

# re-level factor variable "Occurrence.Month"
t_2006_to_2015$Occurrence.Month = factor(t_2006_to_2015$Occurrence.Month, 
                                  levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                                           "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", ""));

# define 3 helper functions for string manipulation
trim_spaces = function(x) { gsub("^\\s+|\\s+$", "", x) }
trim_first_char = function(x) { gsub('^.', '', x) }
trim_last_char = function(x) { gsub('.$', '', x) }

# parse the column "Location.1"
loc_strings = t_2006_to_2015$Location.1
L = sapply(loc_strings, strsplit, split=',')
L = sapply(L, unlist)
loc_x = sapply(L[1,], trim_first_char)
loc_y = sapply(L[2,], trim_last_char)

# add two columns "loc_x" and "loc_y"
t_2006_to_2015$loc_x = as.numeric(loc_x)
t_2006_to_2015$loc_y = as.numeric(loc_y)

# write.table(t_2006_to_2015, file="NYPD_7_FELONIES_2006_2015.csv", sep=",")

After preprocessing, we load the preprocessed dataset as follows. First we load several libraries.

library(ff)
library(ggplot2)
library(dplyr)
library(grid)
library(gridExtra)
library(leaflet)
library(ggmap)
require(cowplot)

And run below to see the detailed information about variables.

# tb = read.csv('NYPD_7_FELONIES_2006_2015.csv', header=T, sep=',',
# stringsAsFactors=F)
tb = t_2006_to_2015
str(tb)
## 'data.frame':    1117256 obs. of  22 variables:
##  $ OBJECTID              : int  258 259 260 261 262 263 264 265 266 267 ...
##  $ Identifier            : chr  "13b6949b" "c9cc0e6d" "b6c31b82" "2598f1c7" ...
##  $ Occurrence.Date       : chr  "01/09/2006 12:00:00 AM" "01/09/2006 12:00:00 AM" "01/09/2006 12:00:00 AM" "01/09/2006 12:00:00 AM" ...
##  $ Day.of.Week           : Factor w/ 7 levels "Monday","Tuesday",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Occurrence.Month      : Factor w/ 13 levels "Jan","Feb","Mar",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Occurrence.Day        : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ Occurrence.Year       : int  2006 2006 2006 2006 2006 2006 2006 2006 2006 2006 ...
##  $ Occurrence.Hour       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CompStat.Month        : int  8 1 2 1 1 1 6 8 1 1 ...
##  $ CompStat.Day          : int  16 9 22 9 19 9 28 31 23 9 ...
##  $ CompStat.Year         : int  2006 2006 2006 2006 2006 2006 2006 2007 2006 2006 ...
##  $ Offense               : Factor w/ 7 levels "BURGLARY","FELONY ASSAULT",..: 3 3 3 1 1 1 3 3 4 1 ...
##  $ Offense.Classification: Factor w/ 1 level "FELONY": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sector                : Factor w/ 30 levels "","0","1","2",..: 19 16 14 12 14 21 16 19 18 17 ...
##  $ Precinct              : int  102 1 24 49 49 52 72 75 104 10 ...
##  $ Borough               : Factor w/ 7 levels "BRONX","BROOKLYN",..: 4 3 3 1 1 1 2 2 4 3 ...
##  $ Jurisdiction          : Factor w/ 25 levels "DEPT OF CORRECTIONS",..: 6 6 6 6 6 6 6 6 6 13 ...
##  $ XCoordinate           : int  1029007 979426 992387 1020746 1028121 1016306 984147 1018380 1014406 985325 ...
##  $ YCoordinate           : int  194256 199624 228450 246675 251461 257760 175275 184864 198825 215233 ...
##  $ Location.1            : chr  "(40.6997596520001, -73.8385879319999)" "(40.7146054090001, -74.017402787)" "(40.7937230380001, -73.9706144319999)" "(40.843673874, -73.868096037)" ...
##  $ loc_x                 : num  40.7 40.7 40.8 40.8 40.9 ...
##  $ loc_y                 : num  -73.8 -74 -74 -73.9 -73.8 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:17] 17334 30414 1112868 1112883 1113049 1113064 1113233 1113320 1113400 1113682 ...
##   .. ..- attr(*, "names")= chr [1:17] "17334" "30414" "1112868" "1112883" ...

4 Exploration

I focus on two perspectives to uncover unknown from this dataset: the when and the where. The “when” is related to revealing how did years, months, days of week, and hours correlate to offenses. The “where” is related to where did the incidents occur.

4.1 Analysis of the When

4.1.1 Analysis by year

t_total_count_by_year_and_offense = tb %>% group_by(Occurrence.Year, Offense) %>% 
    summarise(count = n())

ggplot(data = t_total_count_by_year_and_offense, aes(x = as.numeric(Occurrence.Year), 
    y = count)) + ggtitle("Number of offense insidents over the last 10 years (2006-2015)") + 
    xlab("Year from 2006 to 2015") + ylab("Number of incidents") + geom_point(aes(color = Offense), 
    size = 5) + geom_line(aes(color = Offense), size = 2)

We can learn from the figure that:

  • Grand larceny is the most frequent offense out of 7. The number of indidents caused by grand larceny went down in 2015.

  • There is no clear trend showing felony assult is decling.

  • The number of all the rest offenses are declining.

Overall, NYC is becoming safer than before.

4.1.2 Analysis by month

Now let us see how does the offense incidents distribute across 12 months.

ggplot(data = tb, aes(x = Occurrence.Month, color = Offense)) + geom_bar(aes(fill = Offense)) + 
    ggtitle("Number of offense insidents across 12 months of all 10 years") + 
    xlab("Month (from January to December") + ylab("Number of incidents") + 
    theme(legend.position = "bottom") + scale_x_discrete(labels = c(1, 2, 3, 
    4, 5, 6, 7, 8, 9, 10, 11, 12)) + facet_wrap(~Offense, nrow = 2)

What we can learn:

  • Suprisingly, the number of incidents in April is smallest for all offenses except murder.

  • It seems that Auguest has the second smallest overall incidents for many offenses.

4.1.3 Analysis by day of week

Now we draw the bar view of incidents per day of week.

ggplot(data = tb, aes(x = Day.of.Week, color = Offense)) + geom_bar(aes(fill = Offense)) + 
    ggtitle("Number of offense insidents across days of week in 10 years") + 
    xlab("Day of week") + ylab("Number of incidents") + theme(legend.position = "bottom") + 
    scale_x_discrete(labels = c("Mo", "Tu", "We", "Th", "Fr", "Sa", "Su")) + 
    facet_wrap(~Offense, nrow = 1)

What we find?

  • On Friday, burglary, grand larceny, larceny of motor vehicle, and robery occur most frequently. The feature is visually very perceivable.

  • Murder and rape occurs more on weekends (the figure is a little hard to differentiate)

4.1.4 Analysis by hour

Now we visualize the number of incidents by hour in day.

# This graph shows tons of insights.  TODO: add label for each subplot (add
# 'middle night', 'early morning', 'morning', 'noon', 'afternoon',
# 'evening', 'late night')
ggplot(data = tb, aes(x = Occurrence.Hour)) + geom_density(aes(color = Offense)) + 
    theme(aspect.ratio = 1) + ggtitle("The density of offense insidents across 24 hours of day in 10 years") + 
    xlab("Hour") + ylab("Density of incidents") + theme(legend.position = "bottom") + 
    facet_wrap(~Offense, nrow = 1)

But in order to get more detailed information, we can also look at the histogram.

ggplot(data = tb, aes(x = Occurrence.Hour)) + geom_histogram(aes(fill = Offense), 
    binwidth = 1) + ggtitle("The histogram of offense insidents across 24 hours of day in 10 years") + 
    xlab("Hour") + ylab("Number of offenses") + theme(legend.position = "bottom") + 
    facet_wrap(~Offense, nrow = 1)

It shows that:

  • Burgalary happens most often during 7am - 8am, and least often on 5am.

  • Feloney assult occurrs most often during middle night, and least often during 5am-6am.

  • Grand larceny occurs most often at 12pm - 3pm, and least often on 5am.

  • Larceny of motor vehicle occurs most often at middle night, and least often on 5am.

  • Rape occurs most often at middle night, and least often in the morning.

  • Robbery occurs most often in the afternoon, especially at 3pm, and least often at 7am

  • Murder occurs most often at 11pm and 3am, and least often at 8am.

4.2 Analysis of the Where

According to the footnote pdf file mentioned above, as for this dataset:

  • Crimes occurring anywhere other than at an intersection are represented by a midpoint X coordinate and a midpoint Y coordinate (center of block)

  • Rape offenses are geo-coded as occurring at the police station house within the precinct of occurrence

  • Offenses that lack an X coordinate and Y coordinate are geo-coded as occurring at the police station house within the precinct of occurrence.

  • Offenses occurring in open areas such as parks or beaches may be geo-coded as occurring on streets or intersections bordering the area.

4.2.1 Analysis by borough in 2015

Now we analysis the offense by borough in 2015. Since we will use the library “ggmap” to visualize map, we need to first add two columns: ‘lon’ and ‘lat’:

tb = tb %>% mutate(lon = loc_y, lat = loc_x)

And the graph is shown below.

ggplot(data = (tb %>% filter(Occurrence.Year == 2015)), aes(x = Offense)) + 
    geom_bar(aes(fill = Offense)) + ggtitle("The histogram of offense insidents by borough") + 
    xlab("Offenses") + ylab("Number of offenses") + theme(legend.position = "bottom") + 
    scale_x_discrete(labels = c("Bu", "FA", "GL", "GM", "RP", "RO", "M")) + 
    facet_wrap(~Borough, nrow = 1)

We can see that in 2015:

  • Manhattan has the most number of grand larceny.

  • Brooklyn has the most number of burglary, felony assault, grand larceny of motor vehicle, robbery, and rape.

4.2.2 Interative map

I provide an interactive map for better visualization.

t_MANHATTAN_2015 = tb %>% filter(Borough=="MANHATTAN", Occurrence.Year=="2015")

leaflet(tb, height=300, width=900) %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addMarkers(lng=t_MANHATTAN_2015$loc_y, lat=t_MANHATTAN_2015$loc_x,
             clusterOptions = markerClusterOptions())

4.2.3 Detailed offenses by borough in 2015

Let’s first define a function to draw density map of offenses for different boroughs.

draw_density_map_of_offenses_for_borough = function(area_map_obj, tb) {
    ggmap(area_map_obj, base_layer = ggplot(aes(x = lon, y = lat), data = tb)) + 
        stat_density2d(aes(x = lon, y = lat), data = tb, color = "blue", alpha = 0.5) + 
        stat_density2d(aes(x = lon, y = lat, fill = ..level.., alpha = ..level..), 
            bins = 7, geom = "polygon", alpha = 0.3, contour = T, data = tb) + 
        scale_fill_gradient(low = "green", high = "red") + facet_wrap(~Offense, 
        nrow = 2)
}

Now the density map for Manhattan is plotted below.

area_map_obj = get_map("MANHATTAN NYC", zoom = 13, maptype = "toner-background", 
    source = "stamen")
t_manhattan = tb %>% filter(Borough == "MANHATTAN", Occurrence.Year == "2015") %>% 
    mutate(lon = loc_y, lat = loc_x)
draw_density_map_of_offenses_for_borough(area_map_obj, t_manhattan)

The density map for Queens:

area_map_obj = get_map("QUEENS NYC", zoom = 11, maptype = "toner-background", 
    source = "stamen")
t_queens = tb %>% filter(Borough == "QUEENS", Occurrence.Year == "2015") %>% 
    mutate(lon = loc_y, lat = loc_x)
draw_density_map_of_offenses_for_borough(area_map_obj, t_queens)

The density map for Bronx:

area_map_obj = get_map("BRONX NYC", zoom = 12, maptype = "toner-background", 
    source = "stamen")
t_bronx = tb %>% filter(Borough == "BRONX", Occurrence.Year == "2015") %>% mutate(lon = loc_y, 
    lat = loc_x)
draw_density_map_of_offenses_for_borough(area_map_obj, t_bronx)

The density map for Brooklyn:

area_map_obj = get_map("Brooklyn NYC", zoom = 12, maptype = "toner-background", 
    source = "stamen")
t_brooklyn = tb %>% filter(Borough == "BROOKLYN", Occurrence.Year == "2015") %>% 
    mutate(lon = loc_y, lat = loc_x)
draw_density_map_of_offenses_for_borough(area_map_obj, t_brooklyn)

The density map for Staten Island:

area_map_obj = get_map("Staten Island NYC", zoom = 12, maptype = "toner-background", 
    source = "stamen")
t_staten_island = tb %>% filter(Borough == "STATEN ISLAND", Occurrence.Year == 
    "2015") %>% mutate(lon = loc_y, lat = loc_x)
draw_density_map_of_offenses_for_borough(area_map_obj, t_staten_island)

5 Insights

Now the answers of the proposed questions are as follows.

Q1. Is NYC becoming a safer city over the last 10 years?

Anser: Yes. New York City is becoming safer than before.

Q2. Which months in one year can be considered as unsafe?

Answer: All 7 felony offenses occur on all months of a year, but it seems that April and August are the two least “unsafe” months.

Q3. Which days in a week can be considered as unsafe?

Answer: Oveall, Friday is the unsafest weekday since 5 out 7 felonies occur most often on Friday. On weekend, felony assult and murder occur more often than other felonies.

Q4. Which hours in a day can be considered as “unsafe”?

Answer: Depending on the felony type, many hours within a day are unsafe.

Fortunately, 5am seems to be the only hour within a day that is the least unsafe.

Q5. According to historical data, which borough are more unsafe than others?

Answer: Brooklyn