How to automate R reports on Linux with cron jobs

Linux R

11 November, 2021



You can find the code that we work through in this example here.

The motivation might be something like this: You have some data that is updated at regular intervals (say, daily). The data is held in some database, or perhaps it's an Excel file on SharePoint, or someone manually updates a Google Sheet every day. You've written some data processing pipeline in R that pulls the data; transforms it; creates some sort of report (maybe a PDF or HTML file in R Markdown, or perhaps you're just outputting cleaned data files); and then uploads the report to some server, or attaches it in an email to the client.

Since you're a programmer (or an analyst or a data scientist or whatever you want to call yourself), you're perfectly happy spending hours on end tinkering with something in order to save yourself a few minutes of time down the track. So, you want to automate the whole thing. That's where cron jobs on Linux can help.

This whole process is fairly straightforward, but there are definitely a few details that can catch you out. Before we start, it would be silly not to mention the very useful R packages cronR (for Unix) and taskscheduleR (for Windows). Instead of using these packages, we're going to be creating cron jobs manually. Either method is fine; I just like the idea of becoming more acquainted with cron, because that'll make automating any other task on Linux much easier. And we can pretend to be sysadmins.

Setting up

If you're reading this blog post you're probably on a Linux machine, so open the terminal and paste in the following:

mkdir ~/cron-reporting-example
cd ~/cron-reporting-example
mkdir data && mkdir docs && mkdir output && mkdir scripts
echo "write.csv(iris, 'output/new_iris.csv')" > scripts/new_iris.R
touch master.R
rstudio

This creates a project directory cron-reporting-example in your home path; a bunch of project-related folders that I tend to use for everything; a minimal example script new_iris.R that writes the iris dataset into an output folder; and a master script that will source the iris-writing script. We then open RStudio. I'm assuming you have RStudio installed and it's run by rstudio; if you're on a server then do what you gotta do. Finally, create a new R project file for that directory.

Basic logging

Let's open up the R script master.R. The first thing we're going to add is some logging! It's easy to forget this basic step, but you'll be glad you set it up, trust me.

We're going to make our cron job gather up any messages printed to the R console and output them to a log file. So, anything we print will be captured. I prefer to use cat rather than print for this, as the latter will wrap everything in quotes. We can start with something like this:

cat(paste0(Sys.time(), " Starting cron job...\n"))

Pretty simple: print the date and time, along with a brief message. We can finish with an optional newline. I like to add messages like this before every new piece of the puzzle. For example, I might have a message that says we're about to source the script that downloads data; a message that says we're about to source the R Markdown script that produces the report; and so on. Doesn't have to be too technical, just little milestones in the ground so that if something goes wrong, you have a better idea of where that might have happened.

Sourcing scripts

We can source a script in the usual way. Add this to your master.R script:

cat(paste0(Sys.time(), " Sourcing scripts/new_iris.R...\n"))
source("scripts/new_iris.R")
cat(paste0(Sys.time(), " Finished running scripts/new_iris.R.\n"))
If we now call Rscript master.R in the terminal, we'll see the logging (and the R code) printed, and there should be a new_iris.csv file in the output folder:

One final thing to note: crontabs will run from the home directory (~). This means that file paths built relative to an R project will not work, so we'll need to specifically set the working directory. To do this, I just check if the script is being run interactively or not, and set the directory if not:
if (!interactive()) setwd("~/cron-reporting-example")

This goes at the top of master.R.

Basic crontab

This post assumes that you know what a crontab is; I highly recommend this post if you're interested in a comprehensive introduction. And for help with cron schedule expressions, crontab guru is amazing.

So let's edit our crontab, by running crontab -e in the terminal. If you're a peasant like me and you use nano, it should look something like this:

There are a few key things we need to include in our cron expression. Let's go through them one by one.

So this is what our final crontab will look like:

* * * * * Rscript ~/cron-reporting-example/master.R >> ~/cron-reporting-example/master.log 2>&1

Paste that into your crontab, save and close it, and... wait one minute. Take a look at the modified time of output/new_iris.csv—if everything is working, it should update every minute! You should also have a log file at ~/cron-reporting-example/master.log, which will also update every minute and look something like this:

Okay! At this point I'd recommend deleting the crontab, so that you aren't continually calling a pointless R script (and appending useless information into a logfile) every minute...

Next steps

The next thing you're probably wondering is how to wire this up with an R Markdown script. That's also pretty straightforward, but there are some surprises in how the filepaths work when you're rending a .Rmd file from an R script... But we'll leave that for a future update.



0 comments

Leave a comment