Languages usage over time
I started programming a long, long time ago. The oldest source code I can find in the archives is from 2004. Back then, I was seventeen years old, using Visual Basic and something that I believed being C++. There were even older projects, but I can't figure out if I kept them anywhere. In 2009 that I started using a version control. Originally, I was creating local repositories for every project. Later on, I merged all that in a single repository.
For the past few years, I was tempted to check what useful data can I get from ten thousand commits over eleven years. After all, there is a lot of stuff there, and there should be a lot of opportunities to do some data analysis—sorry, I meant business intelligence, no, wait, big data, or whatever fashionable term is being used right now.
Among other things, I was wondering what languages was I using over time. I had several hypotheses:
As I moved to Linux stack around 2014, I stopped using C# for personal projects, and started writing a lot of Python, Bash, and JavaScript.
More recently, I started using C++ for Arduino projects, but it remained marginal compared to the other languages.
The use of PHP should not be as important as C# from 2010 to 2014, and should completely disappear in the later years.
There should be a pretty heavy use of SQL before 2014 where I was relying a lot on Microsoft SQL Server, but not so much after 2014, where I moved to MongoDB.
It was therefore important to get an image of the usage of the different languages over time. Usually, such comparisons are based on the number of files in a repository, or the number of commits affecting those files, which gives only a rough and often erroneous idea. I tried to have a clearer view by computing the number of lines of code, for each language, being added or removed during a commit.
Methods
The results are based on more than ten thousand commits in my personal repositories. For every diff, I detected the type of language being used, based on the extension of the file. When the extension wasn't provided, the first line of the file was used to detect a possible shebang. I ignored some of the types, considering them as configuration files, images, and anything else which doesn't represent the source code. For instance, XML files, or DNS configuration files, are not part of the results.
Once the language was detected for a given file, I computed the number of lines of code which were added and removed in the commit. Duplicates, that is the line which was moved in a given file or to a different file within the same commit, were ignored. Note, however, that there is no copy-paste detection (i.e. a block of code copied from the source will be counted as new code), nor do I keep track of the changes between the commits: a rollback of another commit would therefore be counted as ordinary added or removed code.
The lines which contain only white-space characters are ignored. On the other hand, I do not exclude the lines which contain just basic symbols such as a closing curly bracket, even if such lines can hardly be considered as code. This has an important repercussion when comparing languages which have a more verbose syntax, such as C#, with the ones which don't, such as Python. To avoid making the algorithm too complex, I didn't make any exceptions for the comments either, that is, a line containing only a comment would be considered as an ordinary line of code.
The lines of code were then aggregated to form a chart, where one line in a chart would represent n days of commits. Those aggregated results are then represented using a logarithmic scale, which is necessary given the huge gap between the weeks which had practically no activity and the ones where a lot of code was added or removed.
Results
The interactive chart below shows the LOCs for some of the languages. In gray are all the languages combined.
- C++
- Bash
- Python
- C#
- EJS
- Java
- JavaScript
- LESS
- PHP
- SQL
- XSLT
Figure 1 Number of LOCs added and removed over time.
Discussion
The data makes it possible to validate the original hypotheses:
As I moved to Linux stack around 2014, I stopped using C# for personal projects, and started writing a lot of Python, Bash, and JavaScript.
True, except for JavaScript. I was surprised to see that the usage of JavaScript only decreased after 2014. The peak in 2014 corresponds to the development of the original blog engine in Node.js. The activity from 2009 to 2014 should match an important number of websites and web applications I was doing. Later on, I was focusing on REST services, which don't have any JavaScript, as well as on web applications such as the Bookshelf app which contain nearly no code running in browser.
More recently, I started using C++ for Arduino projects, but it remained marginal compared to the other languages.
Not that marginal. As I use C++ in only one project, the commits affecting C++ code are not regular, however there are at least three zones where a lot of C++ was changed during several weeks.
The use of PHP should not be as important as C# from 2010 to 2014, and should completely disappear in the later years.
True. As I was writing a lot of PHP in 2008 and 2009, I was only occasionally using it since 2010, and completely stopped after 2014. I hope I'll get back to it, now that it has PSRs, Laravel, and some new cool features.
There should be a pretty heavy use of SQL before 2014 where I was relying a lot on Microsoft SQL Server, but not so much after 2014, where I moved to MongoDB.
Mostly true. I indeed missed year 2018, where I was creating an offsite backup tool which relies on PostgreSQL.
Visualizations as this one make it easy to identify some patterns: languages used the most, gaps, shifts to different technologies. It has its limitations, as anything based on lines of code. For instance, it looks like I was writing more code before 2014, but I would imagine that it has to do with C# using more lines of code than Python. Overall, it could be an interesting tool for someone who wants to understand a bit more what languages he used and uses now.