Have you ever wondered what people use R packages for? Unfortunately, we have no exact answer. A poll might be a way to find out. Crantastica CRAN community site intended for star ratings and comments, however, fails to collect enough votes to bring insights.

Ever since RStudio launched the 0-Cloud repository log files, there have been many blog posts illuminating user activity. Tal Galili specified the popularity of CRAN packages based on package downloads, among others. The log files no doubt represent degree to which the interactivity between user and package, but they need clean-up in the sense that some packages are depended on other ones. Criticism of sample representation aside, this post centers around diminishing the bias of required packages, correcting downloaded ranking table, and hopefully shining more light on the query.

Specifically, according to dependencies described by R package packdep, consecutively downloaded packages are identified for each ip_id (i.e. a daily unique id per IP address) and thereafter the records related to the depended packages are dropped every day. The period of analysis is from 2014-04-01 to 2014-04-30.

First, a comparison of top 15 downloaded packages between the cleaned and original data is given below.

Next, the processed data show different trends over the 30-day period. ggplot2 stands out as the most downloaded package.

Now, let’s look at cleaned log files only. The peak days lie in the middle of a week.

More interestingly, a glimpse of geographic distribution suggests active downloaders are dominantly from the US. Countries far behind but right after are Germany, China, Great Britain and Japan. For more sophisticated ranking system, please refer to Rapporter’s population weighted Global score.

It should also be noted that we’ve filtered out maniac downloading activity. Whoever downloaded beyond 1.5 standard deviations from the mean are excluded. It is interpreted as random activity which reflects little on the interest of this post.

Back to the question that begins this post, the modified data attend to what users do ultimately with R rather than to what’s critical or useful for them.