The Automation of GitHub Processes in RStudio

The GitHub Process on RStudio

GitHub is a vital tool for any Data Scientist, and is incredibly useful when properly integrated in the RStudio Integrated Development Environment (IDE). The default functionality allows for one to manage Code Sets seamlessly with Git through very user-friendly Stage, Commit, Push, Pull, and History buttons. This process is intentionally designed this way to ensure that the user properly follows the Git process to ensure that the appropriate Change Management, Code Merging, and Version Control practices are implemented.

However, if the user is intimately aware of these processes and is sufficiently competent with the industry best practices, then this process can be perceived as being somewhat cumbersome. For example, assuming that the User has correctly installed Git, enabled Git on the RStudio IDE, and Pulled an appropriate Repo for development, then once the Developer has made the necessary changes to the document, then they need to follow these steps:

  1. Press the Commit button on the Git window (usually found in the Environment Pane on the top-right of the RStudio IDE). This will open the Review Changes window.
  2. Type in the details of the changes made in the Commit message box.
  3. Press the Commit button.
  4. Wait for the Commit process to complete.
  5. Press the Push button.
  6. Wait for the Push process to complete.
  7. Close the Review Changes window.

This process may take approximately 15 seconds, depending on various factors; but may also take up to a number of minutes, depending on if there are large data files that need to be uploaded in the Push process. This vignette will outline how to automate this process, and reduce the time taken for the developer to complete this overall process.

Introducing the git2r Package

The git2r package (written by Numerous Authors) provides the ability to “interface to the libgit2 library, which is a pure C implementation of the Git core methods”, and “provides access to Git repositories to extract data and running some basic Git commands” (as detailed on cran).

Usage of the git2r Package

The basic usage of git2r follows the following key steps:

  1. Install the package (only needs to be done the first time it is used);
  2. Load the package (only needs to be done when re-opening an RStudio Workspace);
  3. Stage the changes;
  4. Commit the changes;
  5. Pull the changes; and
  6. Push the changes.

1. Install and Load git2r package

This step does not need to be completed every time the files need to be committed an pushed to Git.

# Install and load the git2r package
install.packages("git2r")
library(git2r)

Note: This Vignette assumes the usage of the file GitHub.R file, which is saved in the same directory as this current Vignette.

# Set up the file to be used in the below command.
FileConnection <- file("GitHub.R")
writeLines( paste0("#This is a test script. Run at: ", format(Sys.time(), "%Y-%m-%d %H:%M:%S"))
, FileConnection
)
close(FileConnection)
rm(FileConnection)

2. Stage the Changes

The add() function takes the repo path as its first argument, and the filename path as the second argument, then stages that file to be ready for the Commit command.

Since the repo is in the same location as this script file, then the function getwd() can be run to get the file path location of this current folder.

# Add the file
add( repo = getwd()
, path = "GitHub.R"
)

3. Commit the Changes

The commit function takes the repo defined in the first argument, and the message defined in the second argument, then commits those changes to be ready for the Push command.

# Commit the file
commit( repo = getwd()
, message = paste0("Update as at: ", Sys.time(), "%Y-%m-%d %H:%M:%S")
)

4. Pull the Updates

It would be safe to assume that you are not the only collaborator working on this file. Therefore, after this file is committed, and before it is pushed to the Repo, Git needs to check that there are no discrepancies between the Committed file, and the Master branch. In order to validate, Git must pull the changes from the Master branch to the local Repo.

Note the following caveats:

  1. If this is a forked branch and you are the only contributor, then Git will always return the message "Already up to date.".
  2. If this branch is part of a Repository that multiple people are contributing to, but no one has made any changes to the Master Branch, then Git will always return the message "Already up to date.".
  3. If this branch is part of a Repository that multiple people are contributing to, and the Master Branch has been updated by someone else, and you try to pull the changes to your local repo, then Git will return the message "CONFLICT (content): Merge conflict in GitHub.R \n Automatic merge failed; fix conflicts and then commit the result.". When this happens, follow the process outlined in section 4.1 below.

NOTE: Be careful about supplying your credentials!

  • If you use the RStudio IDE process, you WILL NOT need to supply your credentials to access Git (because they are already stored by the Authentication used by RStudio);
  • If you use this git2r process, you WILL need to supply your credentials to access Git. Be careful about doing this! If you use a Shared Repo, then EVERYONE ELSE can see your credentials also. If you use a Public Repo, then YOUR CREDENTIALS ARE EXPOSED TO THE PUBLIC!
# Pull the Repo again
pull( repo = getwd()
, credentials = cred_user_pass( username = "YourUserName" # BE CAREFUL!!
, password = "YourPassord" # BE CAREFUL!!
) # NEVER EVER PUSH YOUR CREDENTIALS TO ANY REPOSITORY!!!!
)

4.1 If necessary, merge the changes.

If the Master branch is ahead of your local branch, then you will need to merge the updated Master Branch to your local Repo. The `pull()` function automatically merges these changes, and you will need to review the changes manually on your local Repo. This can be done by following the below steps:

1. The message "fix conflicts then commit the result” means that the changes from the Master branch have been merged with the Committed file on your local Repository, and that these merged differences need to be reviewed before they can be successfully pushed to the master branch.

2. To resolve, this will require manual intervention:

2.1. Once the discrepancies have been resolved, then press the Push button to Push the changes to the Repository, or follow the next command to do so automatically.

2.2. Open the Review Changes window on the Git Panel (press the Commit button to open);

2.3. Either choose to accept the changes from the Master Branch by pressing the Staged CheckBox;

2.4. Or choose not to accept the changes from the Master Branch by NOT pressing the Staged CheckBox;

2.5. Press the Commit button to re-Commit the changes to your local Repo.

3. Once the discrepancies have been resolved, then press the Push button to Push the changes to the Repository, or follow the next command to do so automatically.

5. Push the Updates.

The push() function will take your committed changes, and push the new changes to the online Repository. Like with the pull() function, this will also require your credentials in order for the process to be successful. Be careful about providing your credentials to any script that will be housed on the Internet!

# Push to the Repo
push( object = getwd()
, credentials = cred_user_pass( username = "YourUserName" # BE CAREFUL!!
, password = "YourPassord" # BE CAREFUL!!
) # NEVER EVER PUSH YOUR CREDENTIALS TO ANY REPOSITORY!!!!
)

Conclusion.

Git is an incredibly useful process, and is used as an industry standard for Data Scientists. The RStudio IDE provides a very user-friendly methodology for interacting with GitHub, which is very simple and easy to follow. However, there is a way of interacting with GitHub automatically by using scripts and functions found in the git2r package. Implementing this functionality within an UpdateGitHub() function is fairly straight forward to do, and will streamline and improve your productivity. However, always practice caution with your credentials to ensure that you will never be pushing usernames and passwords to GitHub repositories.

Post Script.

Provided below is a script for updating GitHub, which is a custom UpdateGitHub() function wrapped around the git2r functions.

First, store this code in the first script in the location: paste0(getwd(),"/","UpdateGitHub.R").

Then, execute the second script in the Console.

#### This process will be defined in a function ####
#### It is saved in the location: paste0(getwd(),"/","UpdateGitHub.R") ####
UpdateGitHub <- function(repo=getwd(), untracked=TRUE, stage=TRUE, commit=TRUE, pull=TRUE, push=TRUE) {

#### Input: ####
# - 'repo' must be an atomic string value which is a valid directory that contains the files for this repository. It will be validated by the rprojroot package.
# - 'untracked' must be an atomic logical value which is used for determining whether or not to process the untracked files.
# - 'stage' must be an atomic logical value which is used for determining whether or not to stage the files.
# - 'commit' must be an atomic logical value which is used for determining whether or not to commit the files.
# - 'pull' must be an atomic logical value which is used for determining whether or not to pull from the repo.
# - 'push' must be an atomic logical value which is used for determining whether or not to push to the repo.

#### Output: ####
# - Will print updates from the different stages.

#### Validate parameters: ####
stopifnot(is.character(repo))
stopifnot(dir.exists(repo))
stopifnot(is.atomic(repo))
stopifnot(is.atomic(untracked))
stopifnot(is.atomic(stage))
stopifnot(is.atomic(commit))
stopifnot(is.atomic(pull))
stopifnot(is.atomic(push))
stopifnot(is.logical(untracked))
stopifnot(is.logical(stage))
stopifnot(is.logical(commit))
stopifnot(is.logical(pull))
stopifnot(is.logical(push))

#### Loop through the required packages. If not installed, then install it. If not loaded, then load it. ###
packages <- c("git2r","rprojroot")
for (package in packages) {
if (!package %in% installed.packages()) {
install.packages( package
, quiet = TRUE
, verbose = FALSE
, dependencies = TRUE
)
}
if (!package %in% .packages()) {
suppressPackageStartupMessages(
suppressWarnings(
suppressMessages(
library( package
, character.only = TRUE
, quietly = TRUE
, warn.conflicts = FALSE
, verbose = FALSE
)
)
)
)
}
}

#### Check the Project Root directory. This is to ensure that the entire repo is captured. ####
if (getwd() == find_rstudio_root_file()) {
repo <- getwd()
} else {
repo <- find_rstudio_root_file()
}

#### Check if there is anything to do. ####
if (is.null(unlist(status()))) {
return (writeLines(paste0("There is nothing to do.")))
}

#### Set credentials. This will require input in the Console. ####
username <- readline(prompt = "Please enter your GitHub Username: ") #ALWAYS BE CAREFUL ABOUT STORING YOUR CREDENTIALS ON GITHUB!!
password <- readline(prompt = "Please enter your GitHub Password: ") #ALWAYS BE CAREFUL ABOUT STORING YOUR CREDENTIALS ON GITHUB!!
credentials <- cred_user_pass(username = username, password = password)

#### NOTE: values returned from the status() command are as follows: ####
# 1. "untracked" means new files which have not yet been added to GitHub.
# 2. "unstaged" means existing files which have been modified but not yet ready to be committed to GitHub.
# 3. "staged" means files that are staged and ready to be committed.

#### Check the Untracked items. The intention is to just push these straight through. So this will Add, Commit, then Push them. ####
if (untracked == TRUE) {
num <- length(unlist(status()["untracked"]))
if (num > 0) {
writeLines(paste0("There are ", num, " Untracked items to be processed."))
for (i in 1:num) {
writeLines(paste0(" ", i, ": ",unlist(status()["untracked"])[i]))
}
add(repo, unlist(status()["untracked"]))
writeLines(paste0("Items have been Staged."))
commit(message = paste(Sys.time(), "Initial commit", sep = " - "))
writeLines(paste0("Items have been Committed."))
push(credentials = credentials)
writeLines(paste0("Items have been Pushed."))
}
}

#### Process the Unstaged items. Add them. ####
if (stage == TRUE) {
num <- length(unlist(status()["unstaged"]))
if (num > 0) {
writeLines(paste0("There are ", num, " Tracked items to be processed."))
for (i in 1:num) {
writeLines(paste0(" ", i, ": ", unlist(status()["unstaged"])[i]))
}
}
if (!is.null(unlist(status()["unstaged"]))) {
add(repo, unlist(status()["unstaged"]))
num2 <- length(unlist(status()["unstaged"]))
if (num2 == 0) {
writeLines(paste0("Items have been Staged."))
} else if (num == num2) {
stop ("Something went wrong with the Staging.")
}
}
}

#### Process the Staged items. Commit them. ####
if (commit == TRUE) {
if (!is.null(unlist(status()["staged"]))) {
commit(message = paste(Sys.time(), "Update", sep = " - ")) # Generic message, including timestamp.
num2 <- length(unlist(status()["staged"]))
if (num2 == 0) {
writeLines(paste0("Items have been Committed."))
} else if (num == num2) {
stop ("Something went wrong with Committing.")
}
}
}

#### Do the Pull step. ####
if (pull == TRUE) {
pull <- tryCatch ( #tryCatch is utilised because the error message when executing pull() or push() is not very helpful: "too many redirects or authentication replays". The main issue is usually that the credentials are incorrect or missing.
expr = {
pull(credentials = credentials)
},
error = function (err) {
message (paste0("Error when Pulling from GitHub. Try checking your credentials and try again.","\n","Message thrown: "))
stop (err)
},
warning = function (war) {
message ("There was a Warning when Pulling from GitHub.")
return (war)
},
finally = {
# It was successful. Move on.
}
)
if (unlist(pull["up_to_date"]) == TRUE) {
writeLines(paste0("There are no discrepancies with the Master branch."))
} else {
stop ("Something went wrong with pulling the repo. Please manually check, merge the code, validate discrepancies, then re-try.")
}
}

#### Process the Committed items. Push them. ####
if (push == TRUE) {
if (num > 0) {
tryCatch(
expr = {
push(credentials = credentials)
},
error = function(err) {
message (paste0("Error when Pushing to GitHub. Try checking your credentials and try again.","\n","Message thrown: "))
stop (err)
},
warning = function (war) {
message ("There was a Warning when Pushing to GitHub.")
return (war)
},
finally = {
# It was successful. Move on.
}
)
num2 <- length(unlist(status()))
if (num2 == 0) {
writeLines(paste0("Items have been Pushed."))
} else if (num == num2) {
stop ("Something went wrong with Pushing.")
}
}
}

return(writeLines(paste0("Successfully updated.")))

}
# Run the function.
UpdateGitHub()

Run this last step in the Console.

# Run this script in the Console
source(paste0(getwd(),”/”,”UpdateGitHub.R”))

Also published on RPubs: https://rpubs.com/chrimaho/GitHubAutomation

--

--

--

I’m a keen Data Scientist and Business Leader, interested in Innovation, Digitisation, Best Practice & Personal Development. Check me out: chrimaho.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

An example of example-driven development

Flow & Error Control

Stop & Wait Protocol

Tensorflow Lite(TFLite) with Golang

Hi Kronians Fast Update !!!

People focus in order to scale — RHoK Melbourne Summer 2017

Kubernetes explained deep enough

How to send Stock Updates with Python

iOS 13 — Sign In with Apple ID

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Chris Mahoney

Chris Mahoney

I’m a keen Data Scientist and Business Leader, interested in Innovation, Digitisation, Best Practice & Personal Development. Check me out: chrimaho.com

More from Medium

PowerShell Data Analysis

Screenshot of the PowerShell script being demonstrated in this tutorial from Visual Studio Code.

WHAT is a PIVOT TABLE?

IMDB Database — Data Analytics (Project Excel)

Newbie or Lostie in Data?