“Why does the immersive program teach data science in Python instead of R?” is a frequent question from students who are considering joining our immersive program. I have a lot of experience writing both Python and R, ranging from small one-off projects and analysis to large efforts involving many developers, and I have formulated some well-balanced opinions on Python vs. R. I do believe that Python is the correct choice for an educational program in Data Science.
My general opinion was best expressed by John Cook:
“I’d rather do math in a general-purpose language than try to do general-purpose programming in a math language.”
Python is a general purpose programming language which has developed the ability to do statistical and scientific programming due to libraries like numpy, scipy, pandas, and sklearn. The main focus of the Python core developers is to create a clean and usable programming experience for a wide variety of users. In contrast, R is a statistical language first and a general programming language second. The majority of contributors to R are most interested in advancing the statistical capabilities of the language.
Statistical and machine learning models are an important part of data science, and if viewed only through this lens, R does have an upper hand on Python. It is certainly true that the R user has more powerful data analysis tools available than the Python user (though that gap is closing, at least with respect to the most important and established methods).
The counterpoint to this is that statistical and machine learning models are only part of data science. In my career, expertise in statistics and machine learning has certainly been important, but what I have found to be more important is my ability to do data analysis and modeling in the context of rigorous and clean work products. This involves interacting with databases, the operating system, the web, and other programs (big data tools, code written by co-workers) cleanly and clearly, and in this domain Python dominates R. Python embodies modern opinions of clean program design much more clearly than R, I can certainly say that when inheriting another data scientist’s work product, my outlook is much more positive if the project was executed in Python than in R.
An important part of our Galvanize data science program is growing strong and disciplined programmers, because at the end of the day, data science projects are communicated in code. We focus heavily on tools like git and unix (command line fluency), as this allows our students to automate, share, and organize their work. Python is culturally and practically a much better choice for communicating these lessons.
I do believe that an excellent data scientist will eventually be a competent user of both Python and R. The statements I’m making above are my argument that Python is the correct first language to learn in the data science domain. A good Python programmer will absorb lessons from the language and culture that will transfer more readily to becoming a good R programmer than if the languages were learned in the reverse order.
That’s my take on the whole thing; both languages are exceedingly useful, but as a first language, Python creates better long term outcomes.