Tidy Carrots

Gurudev Ilangovan

September 21, 2017


Thoughts on R and Machine Learning…

I started coding in R in 2015. I was pretty new to coding and especially, data science. I had heard of packages like ggplot2 and dplyr but I assumed they were packages that you learned once you learned the basics of R. ggplot2 seemed tough with words like geoms and aes and dplyr with the dangerous looking %>%. For a year and half, I never made an effort to learn new tools no matter how much I strugglled with base R (and to be honest, never really learnt that properly either).

That was a costly one and half years, I wasted. If I had started with r4ds back then, my data science journey would have been so much more rapid, so much more oriented. The thing with tools of the tidyverse is that they are feed off each other. You learn one tool and you automatically have a headstart when you learn another. They are designed to work with each other. With simply no other tool but dplyr, I could take care of the daily data wrangling and with ggplot2, I could actually compose plots. It almost felt like a superpower. Then the tidyr functions, spread and gather. More importantly, the idea of tidy data.

I still clinged on to the apply family of functions. Not that they are inferior or that map is superior. It’s just that map is easier in terms of organizing your thoughts in an analytical workflow. I truly saw the potential of functional programming where with a few lines of code, you can achieve what you would with a clunky for loop, in a way I never did with apply. It actually paved the way for easy list dataframes. I heavily used this paradigm for a package, approxmapR that I wrote. The thing with dataframes is that they fit into a nice semantic model in terms of thinking and it is easier to organize your thinking in a tabular fashion that if it is scattered throughout.

However, R’s machine learning ecosystem is so disorganized in comparison with python’s beautifully centralized scikit-learn package. NOT! I came across caret recently (Okay, that’s a lie). I gathered myself to learn caret recently. And what a beautiful package it is! To learn caret with Max Kuhn’s (the package author) book Applied Predictive Modeling has opened me up to a world of machine learning possibilites. It is incredibly powerful and that’s coming from a person who barely knows 20% of the functionality in caret.

I saw a post recently that combines the best of both these incredible worlds: pur-r-ify-your-carets. It shows the tremendous amount of possibility in R when you combine powerful tools.

To a person who’s starting out in Data Science and who’s chosen R, I’d like to share a few suggestions in terms of tools and thinking

  1. Tidy data (tidyr)
  2. Group_by and mutate/summarize (dplyr)
  3. Compose your plots (ggplot2)
  4. purrr magic (Tibbles and list columns)
  5. caret for ML

Cheers! Happy hacking!

comments powered by Disqus