Thursday, October 25, 2012

Using R to Analyze Hadoop Jobs

Using summarized data from the Indeed job board, I tried my hand at some R graphs. Considering the hot nature of Big Data, I chose job openings in the United States for Hadoop.

Here is a summary of the top jobs from Indeed:

City State JobPostings
San Francisco, CA California 568
New York, NY New York 402
Seattle, WA Washington 179
Sunnyvale, CA California 174
Palo Alto, CA California 158
San Jose, CA California 134
Mountain View, CA California 124
Boston, MA Massachusetts 119
Annapolis Junction, MD Maryland 119
Reston, VA Virginia 107
San Mateo, CA California 105
Chicago, IL Illinois 95
Redwood City, CA California 94
Los Angeles, CA California 89



It appears that Indeed didn't give me a complete summary of all of the Hadoop jobs, just the top fourteen cities. Oh well, let's look at those.

To visualize the data, I played with a variety of R plot options, but ultimately stopped with a graph using the Cleveland Dot Plot.

The graph shows there were eight cities within California posting a large number of jobs for Hadoop experience (Indeed summarized the cities with at least 89 postings). Of those eight, the largest volume of opportunities--almost 600 postings--were in San Francisco (not labeled on the graph, but you can spot it easily in the original data table).

Within these top US locations, no state other than California had a large number of Hadoop opportunities outside of one major city. As you might guess, these are happening places, such as NYC, Boston, and Seattle.

The largest clustering of Hadoop jobs were in San Francisco and NY City. Behind that were the California tech hot spots such as Sunnyvale, Palo Alto, San Jose, and Mountain View.

Working with R is slightly different from other programming languages. Instead of creating a program that you just run and get results, with R you interact within a workspace and examine the results as you go along.

To produce this graph, I first created the tab delimited file of Indeed job postings you saw above. Then, I had to load that data into the R workspace's memory. Here are the command for that:

setwd("C:/Users/Doug/My Documents/RLibrary/") 
HadoopJobs<-read.table("HadoopJobs2012Oct.txt", header=TRUE)


The first command sets my R working document. The second creates an object called "HadoopJobs" in memory which now contains the job posting counts. With that done, I just needed to produce the dot plot graph (showing job posting counts grouped by US states) and put a title on the top:

dotchart(HadoopJobs$JobPostings, groups=HadoopJobs$State)
title("Hadoop Jobs by State (2012 Oct)")


I find it impressive that R is able to do all of this work in just four simple statements. For full disclosure, I did have to add a couple of other statements. The ones I just showed you put the results on the screen for me to see; in order to save the results to a JPEG picture file so that you could also view it, I had to reissue the graph commands sandwiched between the following two R functions:

jpeg(file="HadoopJobs.jpg")
..do the dot chart commands again...
dev.off()


If you don't have a copy of R, be sure to download a free open-source copy at: http://www.r-project.org.

We may have to wait a while before demand for Big Data file repositories comes to Midwestern cities like Cincinnati, Ohio (in case you are interested, there are six Hadoop job postings here in town). 

No comments:

About Me

My Photo
Helping companies make better decisions via Business Intelligence. INTP working on the E&J. Traveler, reader, family guy, coffee drinker.