I hope you are enjoying the “Learning R Programming for Free” series; here are links to the previous segments (Step One, Step Two, Step Three, Step Four, Step Five, Step Six, Step Seven, Step Eight, Step Nine) to provide some helpful background.
In the previous installment, we learned about ggplot. In this installment, we will discuss setting up R on RedHat Linux 6.
On a recent client project, I was asked to build a PLSQL program to load data from an incoming file to a database table. Sounds simple enough, and I have built this probably 100 times in my career.
To get started, I asked the client to send me a sample of the file we need to read. I was told it is an Excel (XLS) file and that the owners of the file will not supply it as CSV. If you have (or haven’t) written much PLSQL, you’ll know that Oracle doesn’t supply code to read from XLS or XLSX, and that to write that code is several hundred lines of code (that someone will have to maintain). And no, we cannot buy any tools to convert the file.
So, I had to think about what I know in my bag of tricks that could help with this…and the answer was…R! I can write a small shell script wrapper that calls R’s “rio package” to convert my file from XLS to CSV, and then call “sqlldr” to load the CSV data to the target table in the database. Since this such a common task, I will share the entire process from start to finish.
My project environment is Linux this time. First, I’ll teach how to install R on Linux and then I’ll demonstrate how to add the rio package.
To install R on RedHat Linux, use “yum.” You will need root access or work with your system administrator.
yum install R
We now have the “baes-R” installed. Run a test script to check the installation. Type “R” to startup the R interpreter in Linux, then we will print “Hello World!”:
> aString = “Hello World!”
> print (aString)
 “Hello World!”
What does rio do? From the help page:
rio: A Swiss-Army Knife for Data I/O
Streamlined data import and export by making assumptions that the user is probably willing to make: ‘import()’ and ‘export()’ determine the data structure from the file extension, reasonable defaults are used for data import and export (e.g., ‘stringsAsFactors=FALSE’), web-based import is natively supported (including from SSL/HTTPS), compressed files can be read directly without explicit decompression, and fast import packages are used where appropriate. An additional convenience function, ‘convert()’, provides a simple method for converting between file types.
To get rio installed on my particular Redhat server, we needed two more RPMs:
yum install libcurl
yum install libcurl-devel
So now you might be thinking, “Wow, this was easy!”. Not so fast. It would have been nice if we could have used “package.install” to install rio but no such luck. The server returns an error when trying to connect to any CRAN mirror.
Instead, download the source code from CRAN and then install the source using R. Rio has quite a few dependencies. Just like in C programming, all the prerequisites have to be satisfied before you can get the final rio package to build. Here are all the package sources to download from CRAN; they’re listed in order of how they need to be installed to save you some time:
Rcpp_0.12.18.tar.gz zip_1.0.0.tar.gz openxlsx_4.1.0.tar.gz crayon_1.3.4.tar.gz assertthat_0.2.0.tar.gz cli_1.0.1.tar.gz rlang_0.2.2.tar.gz fansi_0.3.0.tar.gz utf8_1.1.4.tar.gz pillar_1.3.0.tar.gz tibble_1.4.2.tar.gz curl_3.2.tar.gz data.table_1.11.6.tar.gz magrittr_1.5.tar.gz forcats_0.3.0.tar.gz BH_1.66.0-1.tar.gz R6_2.2.2.tar.gz pkgconfig_2.0.2.tar.gz hms_0.4.2.tar.gz readr_1.1.1.tar.gz haven_1.1.2.tar.gz rematch_1.0.1.tar.gz cellranger_1.1.0.tar.gz readxl_1.1.0.tar.gz rio_0.5.10.tar.gz
To install each package from the source “tar.gz” files, run the installs as a regular non-root user:
install.packages(“Rcpp_0.12.18.tar.gz”, repos = NULL , type = “source”)
We are telling R:
Install this package: “Rcpp_0.12.18.tar.gz”
Do not use a mirror: repos = NULL
The code is source code: type = “source”
For example, I placed all of my source files in: /home/auser/bin/R
install.packages(“/home/auser/bin/R/rio_0.5.10.tar.gz”, repos = NULL , type = “source”)
Repeat this for all 25 sources. Resolve any dependency errors that might appear. The RPMs on your system may be different from the ones I have.
Test the rio installation in RedHat Linux. If everything went all right with the package installation, take a few test rows like these and add them to Excel:IDFullNameHometownInterest1Mouse, MickeyTugboatMinnie2Duck, DonaldSwim PondBugs3Mouse, MinnieDisney StudiosMickey4Bunny, BugsRabbit HoleCarrots
Save this as an XLS file and move it to the Linux server: “myfile.xlsx.”
Let’s convert it to CSV using “Rscript” (command line R):
Rscript -e “rio::convert(‘myfile.xlsx’,’myfile.csv’)”
[riotest]$ ls -1
Now, let’s look at our CSV file:
[riotest]$ cat *.csv
2,”Duck, Donald”,Swim Pond,Bugs
3,”Mouse, Minnie”,Disney Studios,Mickey
4,”Bunny, Bugs”,Rabbit Hole,Carrots
Looks perfect. Now we can call “sqlldr” to load the data to an Oracle table.
In our next installment, we will discuss the rio package in more detail since it is such a powerful tool to have on hand. The rio package is a great tool to convert data from one format to another format.
Additional blog posts on more complex R concepts to follow; please contact firstname.lastname@example.org if you have any questions or need further help!