Thursday, October 25, 2012

Using R to Analyze Hadoop Jobs

Using summarized data from the Indeed job board, I tried my hand at some R graphs. Considering the hot nature of Big Data, I chose job openings in the United States for Hadoop.

Here is a summary of the top jobs from Indeed:

City State JobPostings
San Francisco, CA California 568
New York, NY New York 402
Seattle, WA Washington 179
Sunnyvale, CA California 174
Palo Alto, CA California 158
San Jose, CA California 134
Mountain View, CA California 124
Boston, MA Massachusetts 119
Annapolis Junction, MD Maryland 119
Reston, VA Virginia 107
San Mateo, CA California 105
Chicago, IL Illinois 95
Redwood City, CA California 94
Los Angeles, CA California 89

It appears that Indeed didn't give me a complete summary of all of the Hadoop jobs, just the top fourteen cities. Oh well, let's look at those.

To visualize the data, I played with a variety of R plot options, but ultimately stopped with a graph using the Cleveland Dot Plot.

The graph shows there were eight cities within California posting a large number of jobs for Hadoop experience (Indeed summarized the cities with at least 89 postings). Of those eight, the largest volume of opportunities--almost 600 postings--were in San Francisco (not labeled on the graph, but you can spot it easily in the original data table).

Within these top US locations, no state other than California had a large number of Hadoop opportunities outside of one major city. As you might guess, these are happening places, such as NYC, Boston, and Seattle.

The largest clustering of Hadoop jobs were in San Francisco and NY City. Behind that were the California tech hot spots such as Sunnyvale, Palo Alto, San Jose, and Mountain View.

Working with R is slightly different from other programming languages. Instead of creating a program that you just run and get results, with R you interact within a workspace and examine the results as you go along.

To produce this graph, I first created the tab delimited file of Indeed job postings you saw above. Then, I had to load that data into the R workspace's memory. Here are the command for that:

setwd("C:/Users/Doug/My Documents/RLibrary/") 
HadoopJobs<-read.table("HadoopJobs2012Oct.txt", header=TRUE)

The first command sets my R working document. The second creates an object called "HadoopJobs" in memory which now contains the job posting counts. With that done, I just needed to produce the dot plot graph (showing job posting counts grouped by US states) and put a title on the top:

dotchart(HadoopJobs$JobPostings, groups=HadoopJobs$State)
title("Hadoop Jobs by State (2012 Oct)")

I find it impressive that R is able to do all of this work in just four simple statements. For full disclosure, I did have to add a couple of other statements. The ones I just showed you put the results on the screen for me to see; in order to save the results to a JPEG picture file so that you could also view it, I had to reissue the graph commands sandwiched between the following two R functions:

jpeg(file="HadoopJobs.jpg") the dot chart commands again...

If you don't have a copy of R, be sure to download a free open-source copy at:

We may have to wait a while before demand for Big Data file repositories comes to Midwestern cities like Cincinnati, Ohio (in case you are interested, there are six Hadoop job postings here in town). 

No comments:

About Me

My photo

I am a project-based software consultant, specializing in automating transitions from legacy reporting applications into modern BI/Analytics to leverage Social, Cloud, Mobile, Big Data, Visualizations, and Predictive Analytics using Information Builders' WebFOCUS. Based on scores of successful engagements, I have assembled proven Best Practice methodologies, software tools, and templates.

I have been blessed to work with innovators from firms such as: Ford, FedEx, Procter & Gamble, Nationwide, The Wendy's Company, The Kroger Co., JPMorgan Chase, MasterCard, Bank of America Merrill Lynch, Siemens, American Express, and others.

I was educated at Valparaiso University and the University of Cincinnati, where I graduated summa cum laude. In 1990, I joined Information Builders and for over a dozen years served in regional pre- and post-sales technical leadership roles. Also, for several years I led the US technical services teams within Cincom Systems' ERP software product group and the Midwest custom software services arm of Xerox.

Since 2007, I have provided enterprise BI services such as: strategic advice; architecture, design, and software application development of intelligence systems (interactive dashboards and mobile); data warehousing; and automated modernization of legacy reporting. My experience with BI products include WebFOCUS (vendor certified expert), R, SAP Business Objects (WebI, Crystal Reports), Tableau, and others.