VCS are systems that manage changes in scripts, documents, binary files…
From the git book
Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later
git
is a VCS, one of many (svn
, mercurial
…)
From the git book
Git thinks of its data more like a series of snapshots of a miniature filesystem. With Git, every time you commit, or save the state of your project, Git basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. To be efficient, if files have not changed, Git doesn’t store the file again, just a link to the previous identical file it has already stored. Git thinks about its data more like a stream of snapshots.
Everything I’ll explain in this course can be found better explained but in a little more hardcore way at the Git Book.
In case of any problem or doubt using git
, always check the book first!
Chances are that you already have git
installed in your system, just type git -v
in your terminal to check.
If is not installed:
In Windows, follow the instrunctions at https://git-scm.com/download/win
This will install 3 utilities, Git GUI
, Git Bash
and Git CMD
, as well as it will allow using git
inside RStudio or VSCode.
Is always a good idea to have data, scripts and documents from a same task organized. Folders, RStudio projects, git projects and a mix of them help us with this, and make some healthy decisions mandatory.
A git projects is nothing else than a folder that is monitored for changes.
So the first thing to do is create a folder to work through this course.
In my case I’m gonna create a folder called project_iris
.
Working in pairs
At some point of the course, we will be working on pairs, so it’ll come in handy to have a unique project name, i.e. project_iris_INITIALS
.
git
projectOnce we are in our folder, we start the git project with the git init
command:
git
projectThere are other ways…
Start a project with RStudio with the version control option checked
Start a project with VSCode with the git option checked
Clone an existing repository (more on this later…)
git
workflowgit
projectWe have our fresh git project just initialized, let’s check the state of the repository with git status
:
$ git status
1En la rama main
2No hay commits todavía
no hay nada para confirmar (crea/copia archivos y usa
3"git add" para hacerles seguimiento)
Basically, that’s it. We are now using git to track changes, so let’s make some changes:
Create a empty file called main_script.R
Open the file and modify with the following:
We have made changes, let’s see if git
is tracking them correctly
$ git status
1En la rama main
2No hay commits todavía
Archivos sin seguimiento:
3 (usa "git add <archivo>..." para incluirlo a lo que será confirmado)
main_script.R
no hay nada agregado al commit pero hay archivos
4sin seguimiento presentes (usa "git add" para hacerles seguimiento)
Let’s see how files are tracked:
We can start tracking files with git add
(and check again with git status
)
$ git add main_script.R
$ git status
1En la rama main
2No hay commits todavía
Cambios a ser confirmados:
3 (usa "git rm --cached <archivo>..." para sacar del área de stage)
nuevos archivos: main_script.R
For now we know that a new file has been created and staged (is tracked), but no history of changes (commits) has started. For this, we need to commit the changes with git commit
:
We can check the status again:
Now we are going to modify our main script, to load some libraries and save it.
/data/iris_project/main_script.R
library(dplyr)
1library(ggplot2)
iris_data <- iris
Check the new status.
$ git status
En la rama main
Cambios no rastreados para el commit:
(usa "git add <archivo>..." para actualizar lo que será confirmado)
(usa "git restore <archivo>..." para descartar los cambios en el directorio de trabajo)
modificados: main_script.R
sin cambios agregados al commit (usa "git add" y/o "git commit -a")
We can see the differences the changes have made with git diff
Remember:
So we need to add and commit the new changes, we can do this for modified files with a shortcut. Instead of git add FILE
followed by git commit -m "MESSAGE"
we can do it in one step with git commit -a -m "MESSAGE"
:
We can see at any time the timeline of changes with git log
:
$ git log
commit 8687941e8cdf7819cf3c6c84b5c8a62ee7ab487e (HEAD -> main)
Author: MalditoBarbudo <victorgrandagarcia@gmail.com>
Date: Wed Jan 31 14:54:08 2024 +0100
Added needed libraries to main_script.R`
commit f484f687fb22990b1f3355b963814b0628748448
Author: MalditoBarbudo <victorgrandagarcia@gmail.com>
Date: Wed Jan 31 14:14:56 2024 +0100
created main_script.R file, loading data
We can simplify the log to see more clearly:
git
projectsLet’s get a git project from the EMF GitHub to be able to work with it. For this we will use git clone
:
$ git clone https://github.com/emf-creaf/meteospain.git
Clonando en 'meteospain'...
remote: Enumerating objects: 2823, done.
remote: Counting objects: 100% (625/625), done.
remote: Compressing objects: 100% (341/341), done.
remote: Total 2823 (delta 394), reused 488 (delta 275), pack-reused 2198
Recibiendo objetos: 100% (2823/2823), 23.68 MiB | 19.81 MiB/s, listo.
Resolviendo deltas: 100% (1829/1829), listo.
Tip
git clone
command must be run in the parent folder of the final destination of the git project (/data
in my case).
As we clone the latest available code in GitHub, the status is clean:
But the log is now a little more crowed than before:
$ git log --oneline --decorate --graph
* 63972c1 (HEAD -> main, origin/main, origin/HEAD) updated README
* 16e583f (tag: v0.1.4) updated cran-comments
* 920d094 updated version and news
* a9bdad1 (origin/devel) fixed global variables call
* 81e2448 Changes in RIA vignette
* aab768f changed style in util
* 184d687 correct version number in description and news, added Rubén F Casal to contributors
* 77a92e8 Merge pull request #21 from rubenfcasal/ria_coord
|\
| * 53051a8 Fix bug in RIA coordinates (#20)
|/
* e93dab4 updated readme with correct badge
....
The first step is to create an account in the GitHub page.
Tip
If possible, use the same email address as the one used to configure git
, this simpligies things
You probably have created a password to access your GitHub account, but for security reasons, that password can not be used when synchronizing your git
projects with GitHub. For that we need to create a Personal Access Token (PAT) as our credentials.
To create a PAT, we need to go to GitHub tokens page and follow the instructions under Generate new token.
TL;DR
In summary, select an expiration period and the scopes needed, usually repo
, user
and workflow
, and click in Genreate token at the bottom.
Important
Copy the generated token to the clipboard, and don’t close the browser window where it is. The PAT will not be accessible again from GitHub page, so we must ensure we don’t lose it accidentally.
We are going to create a new folder in our project, called raw_data
, and we are going to copy there the iris.csv
file from the course materials.
Also, we are going to create an empty text file called .gitignore
Passive-aggressive warning
.gitignore
has a .
at the beginning of the file name, is very important, is not a typo.
Please, add the .
when creating the file!
(continue)
.gitignore
After what we have done in the interlude, if we check the status of our project, we will see two new elements untracked:
.gitignore
Before anything else, we are going to modify .gitignore
file to add the following:
{.git, filename="/data/iris_project/.gitignore"} raw_data
and check the status again
.gitignore
.gitignore
and sensible dataUse .gitignore
to remove tracking from files and folders on your git
project that is not use to have, are sensible (passwords, environmental variables files…) or just you don’t want them to be publicly available if using GitHub.
git
and GitHub workflowIf you remember, the git
workflow was like this:
git
and GitHub workflowBut now, we need to add GitHub to the workflow:
So basically we need to push to GitHub (git push
) and, in ocassions, we’ll need to pull from GitHub (git pull
).
First we need to create a GitHub repository (project) to sync with our ´git` project. So we go there and create it.
We login to GitHub, and we click the big green New button. There we type the name of the repository (same as in our computer, iris_project
) and a description and we click in the Create repository button.
A window will appear with instructions for pushing from our computer for the first time, the bit we need is:
…or push an existing repository from the command line
git remote add origin https://github.com/MalditoBarbudo/iris_project.git
git branch -M main
git push -u origin main
-u
tells git
is the first time pushing to GitHub.
Tip
This first git push
will prompt us for our user and our PAT. If everything is correctly configured, after this time, the credential will be saved.
(again)
As we have already our iris
data in the raw_data
folder, we can modify our main_script.R
file:
But this time, we add a new step, for pushing to GitHub:
$ git push origin main
Enumerando objetos: 5, listo.
Contando objetos: 100% (5/5), listo.
Compresión delta usando hasta 16 hilos
Comprimiendo objetos: 100% (3/3), listo.
Escribiendo objetos: 100% (3/3), 344 bytes | 344.00 KiB/s, listo.
Total 3 (delta 1), reusados 0 (delta 0), pack-reusados 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To https://github.com/MalditoBarbudo/iris_project.git
d5b1b3a..59b52c3 main -> main
We can go to our GitHub repository to confirm is updated.
Sometimes we mess things up. Is human nature. But with git we can always go back in time (commits) and restore the state of our files to an specific commit.
This can be done with git reset
. There is two flavours of reset:
git reset COMMIT_HASH_NUMBER
will reset to the specified commit, but will maintain all changes in our workspace.
git reset --hard COMMIT_HASH_NUMBER
will reset to the specified commit, removing and deleting all changes since then.
Warning
Choose wisely when to use reset
.
Changes reverted by git reset --hard
can not be recovered!
First we look at the log to know the commit hash number to go back to:
$ git log --oneline --decorate --graph
* 59b52c3 (HEAD -> main, origin/main) added more subsetting
* d5b1b3a Added readr and started to subset the data in main script
* 6ce91f9 Added .gitignore file
* 8687941 Added needed libraries to main_script.R`
* f484f68 created main_script.R file, loading data
d5b1b3a
is the target in my case (your hash will be different)
We go back the hard way:
$ git reset --hard d5b1b3a
HEAD está ahora en d5b1b3a Added readr and started to subset the data in main script
And the log:
And, of course we need to push the reset to GitHub
Ups! something is not working:
To https://github.com/MalditoBarbudo/iris_project.git
! [rejected] main -> main (non-fast-forward)
error: falló el empuje de algunas referencias a 'https://github.com/MalditoBarbudo/iris_project.git'
ayuda: Updates were rejected because the tip of your current branch is behind
ayuda: its remote counterpart. If you want to integrate the remote changes,
ayuda: use 'git pull' before pushing again.
ayuda: See the 'Note about fast-forwards' in 'git push --help' for details.
We need to force the push to update the GitHub repository:
$ git push --force origin main
Total 0 (delta 0), reusados 0 (delta 0), pack-reusados 0
To https://github.com/MalditoBarbudo/iris_project.git
+ 59b52c3...d5b1b3a main -> main (forced update)
And… done!
If you remember, at the beginning of the course we said that git
works with snapshots (commits), storing only the relevant changes. So, right now, our git history looks like this:
Working with snapshots means that we can branch out at any point (commit) to start a different path.
For example we had an idea for a very cool analysis, so we branch our code:
But, at the same time, we need to prepare the poster to a congress meeting we go next week.
No time for finishing the cool analysis, but we have the boring normal one, so we continue with that one:
As we said, working with snapshot allows to branch out, but also we are able to branch in, as we have all the information to merge the branches. This way we can have all the basic and the cool analysis stuff together:
To work with branches, we will use git checkout
. This command allows to create and change branches at any time.
First thin is to create a new branch, called cool_analysis
:
We can check the status and/or the log to see we are in a new branch:
$ git status
En la rama cool_analysis
nada para hacer commit, el árbol de trabajo está limpio
$ git log --oneline --decorate --graph --all
* d5b1b3a (HEAD -> cool_analysis, origin/main, main) Added readr and started to subset the data in main script
* 6ce91f9 Added .gitignore file
* 8687941 Added needed libraries to main_script.R`
* f484f68 created main_script.R file, loading data
Now we made changes in our new branch, adding some code to main_script.R
:
And we commit and push our new changes:
$ git commit -a -m 'started cool analysis with setosa'
1$ git push origin cool_analysis
Enumerando objetos: 5, listo.
Contando objetos: 100% (5/5), listo.
Compresión delta usando hasta 16 hilos
Comprimiendo objetos: 100% (3/3), listo.
Escribiendo objetos: 100% (3/3), 415 bytes | 415.00 KiB/s, listo.
Total 3 (delta 1), reusados 0 (delta 0), pack-reusados 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
remote:
remote: Create a pull request for 'cool_analysis' on GitHub by visiting:
remote: https://github.com/MalditoBarbudo/iris_project/pull/new/cool_analysis
remote:
To https://github.com/MalditoBarbudo/iris_project.git
2 * [new branch] cool_analysis -> cool_analysis
Let’s check the log now:
$ git log --oneline --decorate --graph --all
* 717d397 (HEAD -> cool_analysis, origin/cool_analysis) started cool analysis with setosa
* d5b1b3a (origin/main, main) Added readr and started to subset the data in main script
* 6ce91f9 Added .gitignore file
* 8687941 Added needed libraries to main_script.R`
* f484f68 created main_script.R file, loading data
(again)
main_script.R
with:/data/iris_project/main_script.R
library(dplyr)
library(ggplot2)
library(readr)
# reading data
iris_data <- read_csv('raw_data/iris.csv')
# subsetting data
setosa_subset <- iris_data |>
filter(Species == "setosa")
# cool analysis
model_setosa <- lm(Sepal Length ~ Petal.Length, data = setosa_subset)
# plotting
setosa_plot <- ggplot(setosa_subset, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point(aes(size = Petal.Width)) +
stat_smooth(method = "lm") +
theme_bw()
We have our cool analysis finished. But we need to prepare our poster with the main
branch, so we change to it:
We prepare our poster. For that, in the main
branch, we create a new file called poster.Rmd
, add it to git, commit the changes and push to main
at GitHub:
Let’s have a look at our log:
$ git log --oneline --decorate --graph --all
* 2da75fe (HEAD -> main, origin/main) finished poster for international meeting
| * 5ad9254 (origin/cool_analysis, cool_analysis) added plotting
| * 4f152cf added plotting of setosa
| * 717d397 started cool analysis with setosa
|/
* d5b1b3a Added readr and started to subset the data in main script
* 6ce91f9 Added .gitignore file
* 8687941 Added needed libraries to main_script.R`
* f484f68 created main_script.R file, loading data
Now that we are back of our congress meeting (with the best poster award, of course!), we can merge both branches, as we want to add the cool analysis to the main branch and continue from there. For that we will use git merge
.
First, we make usre we are in the main branch, and then we merge:
And voilà, we have all together nicely, we can check with the log:
$ git log --oneline --decorate --all --graph
* 677910e (HEAD -> main) Merge branch 'cool_analysis'
|\
| * 5ad9254 (origin/cool_analysis, cool_analysis) added plotting
| * 4f152cf added plotting of setosa
| * 717d397 started cool analysis with setosa
* | 2da75fe (origin/main) finished poster for international meeting
|/
* d5b1b3a Added readr and started to subset the data in main script
* 6ce91f9 Added .gitignore file
* 8687941 Added needed libraries to main_script.R`
* f484f68 created main_script.R file, loading data
Make descriptive commit messages, it will help you and other people to find quicker where to look if necessary
Find your own commit tempo. Big commits altering a lot of things in a lot of places are not ok, but commits for every individual small change are also not ok. The key is in the middle.
Don’t forget to push
Let’s take a look to how to use git
collaboratively!
In this part of the course we’ll work in pairs, in the same repository.
We’ll work in pairs, OWNER
and COLLABORATOR
.
The OWNER
will create a new local git
repository called iris_git_course
.
The OWNER
will create a file called iris.R
with the following contents:
The OWNER
will add
and commit
the changes.
The OWNER
will create a GitHub repository and synchronize it with their local repo.
The OWNER
of the repository in GitHub must give permissions to any collaborator for pulling and pushing to the iris_git_course
repository. For that, we go to the Settings tab in the repository page.
There, in the Collaborators page, we click in Add people and look for the GitHub username of our COLLABORATOR
.
An invitation is sent to the COLLABORATOR
email and upon accepting we can configure the COLLABORATOR
access.
COLLABORATOR
workThe COLLABORATOR
will clone the repository locally (in their computer).
The COLLABORATOR
will modify iris.R
to add the following at the end of the script:
COLLABORATOR
will commit and push the changes to the repositoryLook at the situation now, the OWNER
has work in GitHub that is ahead of their local repository. We need to synchronize both. To get the latest changes from a repository in GitHub we use git pull
.
Now, the OWNER
have the latest changes.
OWNER
workOWNER
will modify iris.R
to add a plot step:library(ggplot2)
ggplot(iris_setosa, aes(x = Petal.Length, y = Sepal.Length)) +
geom_point() +
geom_smooth(method = "lm")
OWNER
will commit and push the changesCOLLABORATOR
workThe last step is for the COLLABORATOR
to pull the latest changes for the circle to be complete:
Tip
If you are collaborating with someone in the same git repository, is strongly recommended to start always with a git pull
, as that way we ensure we have the latest changes
Thanks a lot for your patience with git
and me :)
05-07 Feb 2024. Version control with git
and GitHub