Thursday, October 25, 2012

Using R to Analyze Hadoop Jobs

Using summarized data from the Indeed job board, I tried my hand at some R graphs. Considering the hot nature of Big Data, I chose job openings in the United States for Hadoop.

Here is a summary of the top jobs from Indeed:

City State JobPostings
San Francisco, CA California 568
New York, NY New York 402
Seattle, WA Washington 179
Sunnyvale, CA California 174
Palo Alto, CA California 158
San Jose, CA California 134
Mountain View, CA California 124
Boston, MA Massachusetts 119
Annapolis Junction, MD Maryland 119
Reston, VA Virginia 107
San Mateo, CA California 105
Chicago, IL Illinois 95
Redwood City, CA California 94
Los Angeles, CA California 89

It appears that Indeed didn't give me a complete summary of all of the Hadoop jobs, just the top fourteen cities. Oh well, let's look at those.

To visualize the data, I played with a variety of R plot options, but ultimately stopped with a graph using the Cleveland Dot Plot.

The graph shows there were eight cities within California posting a large number of jobs for Hadoop experience (Indeed summarized the cities with at least 89 postings). Of those eight, the largest volume of opportunities--almost 600 postings--were in San Francisco (not labeled on the graph, but you can spot it easily in the original data table).

Within these top US locations, no state other than California had a large number of Hadoop opportunities outside of one major city. As you might guess, these are happening places, such as NYC, Boston, and Seattle.

The largest clustering of Hadoop jobs were in San Francisco and NY City. Behind that were the California tech hot spots such as Sunnyvale, Palo Alto, San Jose, and Mountain View.

Working with R is slightly different from other programming languages. Instead of creating a program that you just run and get results, with R you interact within a workspace and examine the results as you go along.

To produce this graph, I first created the tab delimited file of Indeed job postings you saw above. Then, I had to load that data into the R workspace's memory. Here are the command for that:

setwd("C:/Users/Doug/My Documents/RLibrary/") 
HadoopJobs<-read.table("HadoopJobs2012Oct.txt", header=TRUE)

The first command sets my R working document. The second creates an object called "HadoopJobs" in memory which now contains the job posting counts. With that done, I just needed to produce the dot plot graph (showing job posting counts grouped by US states) and put a title on the top:

dotchart(HadoopJobs$JobPostings, groups=HadoopJobs$State)
title("Hadoop Jobs by State (2012 Oct)")

I find it impressive that R is able to do all of this work in just four simple statements. For full disclosure, I did have to add a couple of other statements. The ones I just showed you put the results on the screen for me to see; in order to save the results to a JPEG picture file so that you could also view it, I had to reissue the graph commands sandwiched between the following two R functions:

jpeg(file="HadoopJobs.jpg") the dot chart commands again...

If you don't have a copy of R, be sure to download a free open-source copy at:

We may have to wait a while before demand for Big Data file repositories comes to Midwestern cities like Cincinnati, Ohio (in case you are interested, there are six Hadoop job postings here in town). 

No comments:

About Me

My Photo


With over 20 years of industry experience, Doug Lautzenheiser has provided business intelligence services for well-known organizations such as Procter & Gamble, JPMorgan Chase, Omnicare, Wendy’s International, the State of Indiana, and the State of Oklahoma. ComputerWorld recognized one of Doug's projects with honors for innovative use of technology.  Doug is a featured blogger on BI software at Smart Data Collective.

With his broad knowledge of technologies, business processes, and industry best practices, Doug provides client value by performing strategic advisory services; leading tactical BI application development projects; and enabling dramatic reductions in time, cost, and risks through his unique automated BI consolidation application.

Doug has hands-on experience with a variety of enterprise applications. He is degreed summa cum laude in Information Systems from the University of Cincinnati. An experienced trainer and mentor, Doug has provided educational services to organizations such as National Semiconductor, Ford Motor Company, Northwest Airlines, Principal Financial Group, and Target Stores. Doug is the General Manager of Partner Intelligence.

Talk to Doug before manually performing a large BI initiative. Doug will show you how other smart companies saved time and money by following proven methodologies and automating BI processes instead of letting somebody "wing it" with a manual approach.


B2B software vendor leadership. BI implementations, standardization, and consolidation; data warehousing; WebFOCUS; iWay; BI vendors (Cognos, Business Objects/Crystal Reports, Microstrategy, Actuate, Hyperion/Brio, SAS); ERP; and full SDLC.