frie.codes: curl vs RCurl or: how to choose a package

The text message

Today’s text message is from my good friend Pablo. Pablo is currently in the last months of his PhD in Survey Research at the University of Salamanca, Spain. I know him from my Erasmus year at the University of Essex where we were flatmates and both took classes in survey research. Originally a SPSS / Stata guy, he has been using R more and more over the last few years and I’ve been his personal “R guru”. Which is probably my dream job, tbh.

Anyway, to the text message (excuse the weird highlighting, still figuring that one out):

Pablo Cabrera Alvarez, [17.05.19 12:42]
Hi Frie

Pablo Cabrera Alvarez, [17.05.19 12:43]
I'm desperate with something I need your help

Pablo Cabrera Alvarez, [17.05.19 12:43]
😭😭😭😭

Frie, [17.05.19 12:44]
Oh no what

Frie, [17.05.19 12:44]
Is happening

Pablo Cabrera Alvarez, [17.05.19 12:45]
look, I have this webpage from which I want to download content: download.files() That's ok

[SOME UNHELPFUL BANTER FROM MY SIDE]

Pablo Cabrera Alvarez, [17.05.19 12:45]
My problem is that the webpage needs "authentication"

Frie, [17.05.19 12:45]
Oh OK

Pablo Cabrera Alvarez, [17.05.19 12:45]
I have the credentials

Frie, [17.05.19 12:45]
Yes

Frie, [17.05.19 12:45]
Ah

Frie, [17.05.19 12:45]
Mh

Pablo Cabrera Alvarez, [17.05.19 12:45]
I have tried with Rcurl

Frie, [17.05.19 12:45]
And?

Pablo Cabrera Alvarez, [17.05.19 12:46]
but it looks like the SSL protocol is different

Pablo Cabrera Alvarez, [17.05.19 12:46]
look, this si the error

Frie, [17.05.19 12:46]
Yes

Frie, [17.05.19 12:46]
Can you send me the command?

Frie, [17.05.19 12:46]
I have a bit time to look into it

Pablo Cabrera Alvarez, [17.05.19 12:46]
x <- getURL("https://THISWEBSITE/THISFILE.zip", userpwd="USER:PASSWORD6", httpauth = 4)
Error in function (type, msg, asError = TRUE)  : 
  error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version

In summary, Pablo wanted to use R to download a zip file from the Internet. Of course, he could’ve just downloaded it manually via the browser and put it into his data directory. But doing this in code is actually nice because it increases reproducability and at the same time documents where the data is coming from.

Usually you can achieve this in R by simply using download.file. However, when the file is in any way protected, things get a little bit more complicated. In this case, the file was protected with so called “basic auth”. Basic authentication just means plain old username and password. If you have ever had an ugly looking popup asking you for username and password, that was probably Basic Auth. In those cases, you often have to use a curl wrapper in R. curl is broadly speaking a software for “transferring data in various protocols” (Wikipedia). It consists of a C library called libcurl and a command-line tool called curl.

Enough background info. Let’s get to how I solved it.

My answer

(If you want to skip the story, go straight to the solution.) My initial reaction was: “oh boy, this looks nasty.” I had never seen any error like this before. I knew that an tlsv1 alert protocol version error was probably not coming from a simple mistake that would be easy to a) debug and b) fix. At least not for me.

What I did know was that the last time I personally had used the RCurl package had been in 2014. Since then, I had managed with just using httr. But I also remembered that there was a newer R package called curl.

In the end, my debugging strategy was:

Try with command line curl to rule out server-side errors or errors at the system library level.
If command line curl is successful, use R package curl.

As this conversation happened right at the end of my lunch break (hi, boss, if you ever read this :wave:) and I did not have much time left, I decided to skip 1) and go straight to 2).

(Editing Frie: The following is how I think my process was. Maybe it was totally different?!?! Next time, I’ll screen-record.)

I installed the curl R package on my machine. Next up was probably googling “curl R package” which led me to its website. Right at the start is a summary of the most important functions:

curl_fetch_memory() saves response in memory
curl_download() or curl_fetch_disk() writes response to disk
curl() or curl_fetch_stream() streams response data
curl_fetch_multi() (Advanced) process responses via callback functions

It took me some minutes of not very carefully reading to comprehend that what I needed was curl_download. After I had realized this, I headed back to RStudio and typed ?curl::curl_download in the console to open the help.

From the Description:

Libcurl implementation of C_download (the “internal” download method) with added support for https, ftps, gzip, etc. Default behavior is identical to download.file, but request can be fully configured by passing a custom handle.

“fully configured” sounded good, so I had a look at the Usage section:

From this, it was clear to me where I would need to insert the URL (url) and how I could specify the destination file (destfile). What was not so clear to me was how I could pass the username and password required for basic authentication. But by process of elimination, it became clear to me that it probably had to go into the handle argument:

url: probably the URL we want to download from
destfile: probably the file we want to write to
quiet: no idea but a boolean will not work for username/password. Plus, “quiet” has nothing to do with authentication
mode: from looking at the default argument ("wb"), probably something with the file mode.

So, handle was the only one left. Plus, I vaguely remembered configuring so-called handle objects back when using RCurl.

What I had found out so far:

I took back to Firefox to find out more about the handle, specifically how to pass basic authentication details to it. Because I couldn’t find the needed information on the detailed project website just by skimming (why read carefully if you can just jump around?), I tried the project’s GitHub page. Still, no luck as the “Hello World” examples only covered setting HTTP request headers but not authentication. So finally, I took the time to more carefully read the package website and alas, there was a section on “Configuring a handle”.

Creating a new handle is done using new_handle. After creating a handle object, we can set the libcurl options and http request headers.

Use the curl_options() function to get a list of the options supported by your version of libcurl. The libcurl documentation explains what each option does. Option names are not case sensitive.

“Curl options” sounded good: Over the course of the last 1.5 years, I have written a lot of curl requests in the terminal, e.g. to do quick checks on databases. From this experience, I know that there are command line options for setting basic authentication in the terminal curl command, so there should be underlying libcurl equivalents because after all, terminal curl relies on libcurl. Does this even make sense?

Anyway, I got the options:

[1] 247

Of course, I entered curl::curl_options() to see all the options. But because there are quite a lot and I want to save you from endlessly scrolling, I have added the length for the purpose of this blog post. Getting all options printed out is left as an exercise to the reader. :wink: Because I didn’t have time to read all those 251 options, I decided to take the Google route again and try to find the name of the option on the Internet:

Nice! Especially the CURLOPT_USERPWD immediately appealed to me because in his original RCurl command, Pablo had a userpwd argument as well. Without even checking the links, I headed back to R to find out whether there were any options matching those I found:

         use_ssl        useragent         username          userpwd 
             119            10018            10173            10005 
         verbose    wildcardmatch        writedata    writefunction 
              41              197            10001            20011 
xferinfofunction   xoauth2_bearer 
           20219            10220

Bingo for userpwd!

Final solution

Now I was ready to set up my handle. From the package website, I knew that setting options was done with curl::handle_setopt:

I crossed my fingers and executed the command. And it just worked - not something that usually happens to me. I saved the code in a file and sent it to Pablo, still not sure it’d work on his computer as well. But it did! How cool!

Frie, [17.05.19 13:01]
well does it work for starters? ;)

Frie, [17.05.19 13:01]
(as it depends on system library, could also not work on your machine)

Pablo Cabrera Alvarez, [17.05.19 13:01]
I owe you more than one dinner, believe me

Pablo Cabrera Alvarez, [17.05.19 13:01]
yes yes, I just tried

Pablo Cabrera Alvarez, [17.05.19 13:02]
it's perfect

After approximately 15 minutes, issue solved.:muscle:

Non-technical knowledge or: how to choose a package

However, there was still an open question:

Pablo Cabrera Alvarez, [17.05.19 13:01]
how did you know?? I have been three hours visiting forums and stuff

By that time, I really had to get back to work so my answer was a bit short and off-cutting. But it’s a good question that points to the importance of what I like to call “non-technical knowledge”. What I mean by this is having the knowledge to answer questions like:

what packages exist for solving problem z?
which package do I use for solving z? x or y?
is this Stackoverflow answer worth trying out?
how do I google my problem?
where can I find good information?
…

Of course, technical skills help with answering those questions but it is not quite the same.

While I could talk about each of those questions for ages, let’s focus on the first two for the moment: How did I knew about the curl package and why did I prefer it over RCurl?

For me personally, the answer to the first question boils down to keeping up with the latest developments in R. I use Twitter for that purpose because the R community is quite active there (under the hashtag #rstats, not #R!) and I follow many many R users and developers. For all people who do not want to ruin their phone usage statistics, Maëlle Salmon has written a good blog post on “Keeping up to date with R news”. Among her recommendations are mailing lists, news aggregators like R-Bloggers or R Weekly, attending meetups and conferences and much more.

As for the second question - “do I use package x or y?” -, I think the following “rules” feed into my decision:

use the tidyverse (or ROpenSci) version if there is one: The tidyverse is probably the biggest change the R language has experienced in the last ~5 years. Thanks to the core developers being actually employed for doing this work by RStudio, tidyverse packages, the official ones in particular, are very well maintained and up to date. Similarly, the non-profit initiative rOpenSci, maintains a list of packages that are “carefully vetted, staff- and community-contributed R software tools that lower barriers to working with scientific data sources and data that support research applications on the web.” So, if I have to choose between a package that is part of tidyverse or rOpenSci and one that is not, I’ll always choose the former.
use the more popular package (e.g. CRAN downloads): There are almost 15,000 R packages on CRAN¹, a massive number. Of course, each package has its value but in general, the more downloads a package has, the higher the probability it’ll work in my experience. Another indicator of importance / popularity are the number of GitHub stars. Popular packages are just too important to be left without updates and bug fixes (at this point, let’s have a round of applause for all the open source developers who put a lot of work and heart - often in their free time - into developing R packages! :clap::clap::clap:).
use the newer package / don’t use an unmaintained package: Newer is not always better but if the publication date of the package I’ve encountered during my Google search is a few years back, I’ll try to google again. You can find the publication date of a package on its CRAN page. Especially given that Stackoverflow answers go back over 10 years, I find it worth checking the date of the answer and the publication date of the recommended package. Update 2019-05-22, 20:37: This is particularly relevant because old, unmaintained packages can have serious security issues which can be desasterous. It is not just a thing of it working or not working. Even if it works, it could still be the case that it is not properly protected against newer types of vulnerabilities.
use the package with the better documentation: This is just out of convenience. I am not the biggest fan of using the built-in help because it often does not provide enough context for me to get started. This is why I really love me a good GitHub Readme or even package website like https://rplumber.io (again, round of applause for those writing docs :clap::clap::clap:). If in doubt, I’ll choose the package with more / better documentation. This does not mean that the a package with less docs is necessarily worse at doing its job. But it’s just easier to start out with an example from the Readme than to be left alone with ?.
use packages from people that I trust to be good developers: This final “rule” feeds back nicely to “staying updated”. If I see that a package has been developed by someone I “know” from Twitter, I’m more likely to trust that it is good. Which is a bit silly because someone without Twitter could be as good as a developer as this Twitter person with 10,000 followers. For me personally, it just serves as an additional way of establishing trust in the quality of the package.

Those “rules” are roughly in order of importance although I guess the order and relative importance of them differs depending on the specific case. Sometimes, there is a “popular” package as measured by the number of package downloads but it is just popular because it has been around forever. Sometimes, people with a lot of followers on Twitter produce shitty packages. And sometimes although very rarely nowadays, those “rules” just fail and I end up using a package with 10 downloads from 5 years ago. ¯\_(ツ)_/¯

For the RCurl vs curl case described above, it was a combination of 4. and 5. I knew from Twitter that there was a new package for curl operations from Jeroen. I had also heard a lot of praise about his work which I could only agree with after having used his openssl and jose packages for developing sealr. The curl package also had a nice project website + GitHub Readme and it was easy for me to check that Jeroen was still actively working on the package. In contrast, as I mentioned above, I had not used RCurl since 2014 and it does not have a nice GitHub repository, only a old-school looking website that I actually only found after checking again for this blog post (nothing against old school but yeah).

Update 2019-05-22, 20:37: After posting about this post on Twitter, Jeroen was so kind to quote-tweet my tweet, confirming my suspicion about RCurl being an outdated package:

I'm probably biased, but imo all #rstats users should make the switch to ‘curl’/‘httr’ asap. The old ‘RCurl’ pkg has been unmaintained for years and is broken beyond repair. It's unfortunate this is unknown to many new users. https://t.co/kmlWTicYHN
— Jeroen Ooms (@opencpu) May 22, 2019

So we can add “rule” number 3. (which I updated as well to emphasize the security reasons) to the list.

Finally, that error looked really nasty and I just didn’t want to have that on my screen. :joy:

via GIPHY

The end

Well, this escalated into quite a long post. Let me know on Twitter if I should try to keep it shorter or whether this is fine.

I still hope it was interesting for you and you could take something away from this – and if this “something” is that I probably spend too much time on Twitter…you’re right.

Until next time: keep coding. ❤️

Source: https://cran.r-project.org/web/packages/↩︎

curl vs RCurl or: how to choose a package

The text message

My answer

Final solution

Non-technical knowledge or: how to choose a package

The end

Corrections

Reuse