GitHub Language Correlation for Jan 1 2014 - Feb 1 2015

GitHub seems like a great place to view some statistics regarding development languages.

I wanted to see what I could find out there in regards to GitHub data.

There were a couple choices I had: use their open API (which allow for a limited set of use daily) or use Google's Big Query engine which happens to have a ton of GitHub data available in it's list of public data sets.

So I went with Google's Big Query engine. And it's pretty slick. Despite having a lot of intermittent failures when running queries or downloading data, I was able to successfully do some pretty cool querying.

I found some data in there (old data, from 2012) regarding language correlation. I knew I wanted to do something like that, but definitely needed something more recent than ancient 2012.

And thanks to Data Hacker MD's post on Language Use On GitHub I was able to get started.

The steps were pretty simple:

1) Get a free 60 day trial account on Google's Big Query

2) Filter out the data I wanted, and shove that into a new "table"

3) Export the table to a very large CSV file

4) Run Data Hacker MD's python script over the data (which uses Spearman's Rank Correlation to determine the correlation between any 2 languages)

5) Exports the results as a SVG file (Scalable Vector Graphics) that you can see below

What we have here is an answer to the question:

"If person A writes code in Language X, are they also likely to write code in Language Y?"

Or in other words:

"If someone writes a lot of Javascript, does that mean they also write a lot of CSS?" (Answer: chart below shows a slightly strong correlation not surprisingly)

Some interesting conclusions:

  • There is a fairly strong positive correlation between PowerShell and TypeScript...very weird
  • C# users aren't likely to avoid other languages (I don't see any negative correlations in the C# column...well except for Puppet)
  • But Visual Basic Users, they area a bit more insular and seem to favor certain languages over others.
  • Plus, very few languages had a strong negative correlation (XSLT and Julia...)

But overall, and maybe it's just me, but comparing my chart from 2014 - Feb 2015 to Data Hacker MD's chart, seem to show that software developers are developing across multiple languages more and more (i.e. less redish coloring)

2014-2015 GitHub Language Correlation