How to Use R for Data Science (2024)

Content of this chapter

  1. Why R?
  2. How to learn R?
  3. What are R and RStudio?
  4. How to use R and RStudio without installation
  5. How to install R and RStudio
  6. How to write and run code in R
  7. What are R packages?
  8. What is a function in R?
  9. What are objects in R?
  10. Base R and the tidyverse universe
  11. Are there some guidelines for working with R?

1.1 Why R?

R is an open-source programming language that allows to analyse and manipulate data, create state-of-the-art graphics, and many more. It supports larger data sets, reads any type of data, and runs on multiple platforms (Windows, Mac, Linux) and CPU architectures (x86_64, arm64). R makes it easier to automate tasks, organize projects, ensure reproducibility, and find and fix errors, and anyone can contribute packages to improve its functionality. Moreover, the following points are worth to emphasize:

  • R is an artist! Check out:
  • R is an employment insurance! Programming is a core skill in research, economics, and business. If you can write code, you have plenty of opportunities to earn a decent salary. R is one of the most widely used programming languages in the world today. It is used in almost every industry such as finance, banking, medicine or manufacturing. R is used for portfolio management, risk analytics in finance and banking industries. Even if you need to learn a new programming language later, knowing R makes it much easier to pick up another one.
  • R uses the computer and computers are great! Doing statistics on a computer is faster, easier and more powerful than doing it by hand. Computers are an extension to your brain and can do repetitive tasks better and faster without making logical errors. The only reason to do statistical calculations with pencil and paper is for learning purposes.
  • Excel is limited! Using spreadsheets software like Microsoft Excel for research can be problematic. It’s easy to lose track of operations, making the process difficult to oversee and document. Command-line programs are maybe not as easy to learn but offer a more straightforward approach that allows the results to be replicated easily.
  • R is open source! Proprietary software expansive, support can only be provided by the copyright owner which means the software expires and you can’t do anything against it. Moreover, security issues cannot be checked as the source code is not available, and possibilities for customization are limited. R is yours and everybody can contribute to its success.
  • R is big! When you download and install R, you get some basic packages, that contain functions that allow you to do already a lot of things. Beyond that, you can write your own packages or install user-written packages that extend your possibilities. With over 20,684 packages on the CRAN repository and many more available on GitHub and other platforms, R’s extensive library supports a wide variety of data science tasks. Its widespread use and open-source availability have cemented R as a standard tool in data science and ensured that there are multiple approaches to most data handling processes. These can be easily adopted.

R has weaknesses

For newcomers to programming, the learning curve is rather flat at the beginning. One reason is that R tools are spread across many packages, which can overwhelm beginners. There is no centralized support and the helpful and active online community have different backgrounds. It can be difficult for beginners to find the right solution as there are often many different ways to tackle the same problem. Moreover, R can be slower than languages like Python, MATLAB, C/C++ or Java.

1.2 How to learn R

There are many different approaches to learning R. It pretty much depends on your preferences, needs, goals, prerequisites and limitations. It is up to you to search and find a suitable way to achieve your learning goals. While I hope you find my notes helpful, I additionally provide in section Section 1.3 a list of other resources that are worth considering. To start with, I recommend my swirl courses that provide an interactive learning environment, see Chapter 3.

Make your hands dirty!

Learning a programming language can, like learning a foreign language, be daunting and frustrating. However, if you put in the effort and are not afraid to make mistakes, anybody can learn it. You don’t have to be a nerd. To have a guide next to you can help and speed up your progress significantly. The key is taking action and getting involved. I mean, do write code. Try to copy the code that you read here and elsewhere. Explore what the code does on your machine. Don’t be afraid to make errors. Your PC will not explode. In this paper, most of the code is written in a manner that allows you to effortlessly copy and reproduce the output on your PC. Take advantage of this opportunity and go for it! Hands-on practice is far more enjoyable than merely reading through the material.

Here are some comments that may help you to learn efficiently:

  • Computers need clear and precise instructions to work: They can’t handle mistakes or unclear directions. Even small errors like a missing comma or an unclosed bracket can cause your code to not work. Computers do exactly what you tell them, no more and no less.
  • Copy, paste, and tweak: While learning code from scratch is sometimes essential, you can speed up your work by modifying code that already exists. I call this the “copy, paste, and tweak” approach. While this is not the only way to learn code, it gets a job done quick, and it is fun see Figure1.1.
  • Have a purpose when coding: Rather than learning to code for its own sake, it is more fun and you’ll probably learn faster when you have a goal in mind. Try to analyze data that you are interested in. Another good exercise is replicating a research paper.
  • Practice is key: The best method to improving your coding skills is through lots of practice. Consequently, these notes give you plenty of exercises.
  • Use ChatGPT: The usage of supporting tools is not forbidden. ChatGPT can help you to understand code and brainstorm solutions. However, it’s important to know that ChatGPT might suggest complex methods when there are shorter and more elegant solutions available Absolute beginners might find ChatGPT’s solutions overwhelming and have difficulties to tweak the proposed sketch of a solution. So, use it thoughtfully.

1.3 Learning resources

Thousand of freely available books and resources exist. bookdown.org and the Big Book of R are two vast collections of links to R books that might verify my claim.

In RStudio you find in the right side at the bottom a panel that is called Help. There you find a lot of links, manuals, and references that offer you tons of resources to learn R for free including: education.rstudio.com and Links for Getting Help with R. At the top right of RStudio you find a panel called tutorial. Here you can install the learnr package that offers some nice interactive tutorials.

Since you may feel overwhelmed by the number of resources, I would like to highlight some books:

How to Use R for Data Science (2) How to Use R for Data Science (3) How to Use R for Data Science (4) How to Use R for Data Science (5)

  1. Timbers et al. (2022): Data Science: A First Introduction is a free and up to date book that comes with exercises with worksheets that are available on UBC-DSCI GitHub repository
  2. Wickham & Grolemund (2023): R for Data Science: Import, Tidy, Transform, Visualize, and Model Data is the most popular source to learn R. It focuses on introducing the tidyverse package and is freely available online.
  3. Irizarry (2022): Introduction to Data Science: Data Analysis and Prediction Algorithms With R is a complete, up to date, and applied introduction.
  4. Venables et al. (2022) An Introduction to R: Notes on R: A Programming Environment for Data Analysis and Graphics is a manual from the R Core Development Team that shows how to use R without having to install and load additional packages.
  5. Neth (2023): Data Science for Psychologists is a comprehensive introduction to R and data science for non experts of both programming and data science. It uses a variety of data types and includes many examples and exercises.
  6. Kabacoff (2024): Modern Data Visualization with R teaches how to create graphs from scratch providing a lot of examples that you can copy, paste and tweak.

Some other sources that are worth mentioning are these:

1.4 What are R and RStudio?

Throughout this book, I will assume that you are using R via RStudio. First time users often confuse the two. At its simplest, R is like a car’s engine while RStudio is like a car’s dashboard as illustrated in Figure Figure1.2.

More precisely, R is a programming language that runs computations, while RStudio is an integrated development environment (IDE) that provides an interface by adding many convenient features and tools. So just as the way of having access to a speedometer, rearview mirrors, and a navigation system makes driving much easier, using RStudio’s interface makes using R much easier as well.

Much as we don’t drive a car by interacting directly with the engine but rather by interacting with elements on the car’s dashboard, we won’t be using R directly but rather we will use RStudio’s interface. After you install R and RStudio on your computer, you’ll have two new programs (also called applications) you can open. We’ll always work in RStudio and not in the R application. Figure Figure1.3 shows what icon you should be clicking on your computer.

After you open RStudio, you should see something similar to Figure Figure1.4 where three or four panels dividing the screen.

  1. The Environment panel, where a list of the data you have imported and created can be found.
  2. The Files, Plots and Help panel, where you can see a list of available files, will be able to view graphs that you produce, and can find help documents for different parts of R.
  3. The Console panel, used for running code. This is where we’ll start with the first few examples.
  4. The Script panel, used for writing code. This is where you’ll spend most of your time working.

The Console panel will contain R’s startup message, which shows information about which version of R you’re running. My startup message at the time of writing was as follows:

 R version 4.3.3 (2024-02-29) -- "Angel Food Cake" Copyright (C) 2024 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

If you don’t have panel number 4, open it by opening an existing R-script or creating a new one. You can create a new on by clicking Ctrl+Shift+N (alternatively, you can use the menu: File\(\rightarrow\)New File\(\rightarrow\)R Script).

You can resize the panels as you like, either by clicking and dragging their borders or using the minimise/maximise buttons in the upper right corner of each panel. Clicking Ctrl++ and Ctrl+- allows to make the fonts larger or smaller.

1.5 How to use R and RStudio without installation

If you don’t want to install R on your PC or you don’t have admin rights to do so or if you want to run R on your tablet (IPad or Chromebook) or even your smartphone, you can use RStudio online doing cloud computing on https://posit.cloud. Posit Cloud (formerly RStudio Cloud) is a cloud-based solution that allows anyone to use RStudio online and navigate it through your web browser. It is free for individuals with some restrictions and limited capacities.

1.6 Installing R and RStudio

To install R on your personal computer (Windows, Mac, Linux), you will first need to download and install both R and RStudio (Desktop version) on your computer. It is important that you install R first and then install RStudio.

  1. Do this firstly: Download and install R by going to https://cloud.r-project.org/.
    • If you are a Windows user: Click on “Download R for Windows”, then click on “base”, then click on the Download link.
    • If you are macOS user: Click on “Download R for (Mac) OS X”, then under “Latest release:” click on R-X.X.X.pkg, where R-X.X.X is the version number. For example, the latest version of R as of March 29, 2024 was R-4.3.3.
    • If you are a Linux user: Click on “Download R for Linux” and choose your distribution for more information on installing R for your setup.
  2. Do this secondly: Download and install RStudio at https://www.rstudio.com/products/rstudio/download/.
    • Scroll down to “Installers for Supported Platforms” near the bottom of the page.
    • Click on the download link corresponding to your computer’s operating system.

1.7 What is a function in R?

R is a functional programming language. If you want R to do something, you need to use a function. Or, in the words of Chambers (2017, p. 4):

“Everything that happens is a function call.”

For example, when you like to exit R, you do it with the function q():

> q()Save workspace image? [y/n/c]: 

If you want to specify what exactly you want R to do for you, you need to refer to the arguments of a function. For example, if you don’t want to be asked interactively what you want to do with your workspace (this is the place where you store all your objects, see section Section 1.8), you can do this with an argument that is part of the q() function:

> q(save = "no")

To learn more about a function, you can access its documentation by typing a question mark followed by the function name into the Console:

?q()

Unfortunately, the documentation can sometimes be a bit confusing for beginners in applied contexts. However, the documentation for all functions is structured similarly, typically featuring several key sections:

  • Description: A brief overview of what the function does.
  • Usage: How to use the function, including the function name and its arguments.
  • Arguments: Detailed descriptions of each argument the function accepts, including what types of values are expected.
  • Details: Additional details about the function’s behavior and any important notes.
  • Examples: Practical examples demonstrating how to use the function in various contexts.

Understanding these sections can significantly enhance your ability to navigate and utilize R.

An excerpt of the R Documentation for the function q() is shown in Figure1.5. Here, we observe that the function has three arguments that you can manipulate. If you do not specify any of these arguments explicitly, we see that by default, R sets the three arguments as shown.

1.8 What are objects in R?

R is an object oriented programming language. That means,

“everything that exists in R is an object” (Chambers, 2017, p. 4).

Objects are the fundamental units that are used to store information. Objects can be a variety of data types, including vectors, matrices, data frames, lists, functions. Moreover, you can store empirical results, tables, figures and many more in form of so-called objects. All objects are shown in the workspace which is shown in the Environment panel.

In R, you can show the content of the workspace with ls(). The function rm() allows to remove objects and with rm(list=ls()) you clear all objects from the workspace.

1.9 What are R packages?

A package is a collection of functions, data sets and other R objects that are all grouped together under a common name. More than 20,000 packages are available at the official repository (CRAN)1 and many more are publicly available through the internet.

However, before we get started, there’s a critical distinction that you need to understand, which is the difference between having a package installed on your computer, and having a package loaded in R. When you install R on your computer only a small number of packages come bundled with the basic R installation. The installed packages are on your computer. The critical thing to remember is that just because something is on your computer doesn’t mean R can use it. In order for R to be able to use one of your installed packages, that package must also be loaded. Generally, when you open up R, only a few of these packages (about 7 or 8) are actually loaded.

Package management

  1. A package must be installed before it can be loaded.
  2. A package must be loaded before it can be used.

We only need to install a package once on our computer. However, to use the package, we need to load it every time we start a new R environment or R Studio, respectively.

1.9.1 Package installation

To install an R package you can use the GUI of R Studio or the command line. In R Studio you can click on the Packages tab, then on the Install button, then you must search for a package and click Install. An alternative way to install a package is by typing

install.packages("package_name")

in the console pane of RStudio and pressing Return/Enter on your keyboard. Note you must include the quotation marks around the name of the package.

If you want to update a previously installed package to a newer version, you need to re-install it by repeating the earlier steps or you use update.packages(). To uninstall packages you can use remove.packages().

The installation of packages can take some time. However, if your CPU has many cores, you can speed up the process a lot using the argument Ncpus like this update.packages(ask = F, Ncpus = 4L). This option allows you to adjust the number of parallel processes R can use on your PC. So, if you have a CPU with many cores you can increase that number. A tutorial on how to set the number of cores used by R permanently can be found here.

1.9.2 Package loading

Recall that after you’ve installed a package, you need to load it. We do this by using the library() command. For example, to load the ggplot2 package, run the following code in the console pane. What do we mean by “run the following code”? Either type or copy-and-paste the following code into the console pane and then hit the Enter key.

library("ggplot2")

If after running the earlier code, a blinking cursor returns next to the > “prompt” sign, it means you were successful and the ggplot2 package is now loaded and ready to use. If, however, you get a red “error message” that reads

Error in library(ggplot2) : there is no package called ‘ggplot2’

It means that you didn’t successfully install it. If you get this error message, go back to section Section 1.9.1 on R package installation and make sure to install the ggplot2 package before proceeding.

One very common mistake new R users make when wanting to use particular packages is they forget to load them first by using the library() command we just saw. Remember: you have to load each package you want to use every time you start RStudio. If you don’t first load a package, but attempt to use one of its features, you’ll see an error message similar to:

Error: could not find function

R is informing you that you are attempting to use a function from a package that has not yet been loaded. Forgetting to load packages is a common mistake made by new users, and it can be a bit frustrating to get used to at first. However, with practice, it will become second nature for you. Unloading packages can be done with detach(package:ggplot2, unload=TRUE).

1.9.3 Simplified package management with p_load

There is a convenient way to handle both package installation and loading without the need for manual checks. The p_load() function does this seamlessly by verifying if a package is installed; if not, it installs the package.

For instance, instead of the traditional approach:

install.packages( c("arrow", "babynames", "curl", "duckdb", "gapminder",  "ggrepel", "ggridges", "ggthemes", "hexbin", "janitor", "Lahman",  "leaflet", "maps", "nycflights13", "openxlsx", "palmerpenguins",  "repurrrsive", "tidymodels", "writexl") )library( c("arrow", "babynames", "curl", "duckdb", "gapminder",  "ggrepel", "ggridges", "ggthemes", "hexbin", "janitor", "Lahman",  "leaflet", "maps", "nycflights13", "openxlsx", "palmerpenguins",  "repurrrsive", "tidymodels", "writexl") )

You can streamline the process as follows:

# Install and load the required packages using p_loadif (!require(pacman)) install.packages("pacman")pacman::p_load( arrow, babynames, curl, duckdb, gapminder,  ggrepel, ggridges, ggthemes, hexbin, janitor, Lahman,  leaflet, maps, nycflights13, openxlsx, palmerpenguins,  repurrrsive, tidymodels, writexl)

The line

if (!require(pacman)) install.packages("pacman")

ensures the installation of the pacman package, which is necessary for using the p_load function.

Before you load packages in a script, I recommend to unload all other packages with

pacman::p_unload(all)

to avoid conflicts of functions (see Section 4.5).

1.10 Base R and the tidyverse universe

Upon successfully installing R, you gain access to functions that are part of Base R. This includes standard packages automatically installed and loaded with each R session, such as stats, utils, and graphics, providing a broad spectrum of functionalities for statistical analysis and graphical capabilities (see Venables et al., 2022). However, the syntax in Base R can become complex and less intuitive for users. Consequently, many individuals, including Hadley Wickham, the Chief Data Scientist at Posit (formerly RStudio), and his team, have developed an alternative suite of packages known as the tidyverse. These packages share a common philosophy and syntax, emphasizing readability and ease of use. We will heavily utilize the tidyverse in the following sections.

The R package tidyverse (see Figure1.6) is a comprehensive collection of R packages including popular packages such are ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, and forcats, which together offer extensive capabilities for data modeling, transformation, and visualization.

How to do data science with tidyverse is the subject of multiple books and tutorials. In particular, the popular book R for Data Science by Wickham & Grolemund (2023) is all about the tidyverse universe. Thus, I highly recommend reading sections Workflow: basics), Data transformation, and Data tidying. Additionally, explore www.tidyverse.org for more resources, and consider completing the tidyverse module in my swirl package, swirl-it, as detailed in section Chapter 3.

To install and load tidyverse run the following lines of code:

install.packages("tidyverse")library("tidyverse")

1.11 Are there some guidelines for working with R?

To avoid running into issues with R, it’s important to be aware of the some conventions, rules, and best practices that apply to the language. While it can be tedious to go through all of the do’s and don’ts in detail, following them can make your life with R much easier. Trust me, the benefits of adhering to these guidelines will become clear over time. Here is a non-exhaustive list:

  1. Do remember that R programming language is case sensitive.
  2. Do start names of objects such as vectors, numbers, variables, and data frames with a letter, not a number.
  3. Do avoid using dots in names of objects.
  4. Do avoid using certain keywords in naming objects, such as if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, and NA.
  5. Do use front slash / instead of backslash \ for navigating the file system (see Appendix A).
  6. Do not use whitespace and indentation for naming files, directories, or objects.
  7. Do not ignore warnings or errors unless you know what they mean.
  8. Do define objects to represent hard-coded values instead of using them directly in code.
  9. Do remember to (install and) load packages that contain functions you want to use.
  10. Do remember to set your working directory. (Tip: Use R Studio projects, see Section A.4)
  11. Do remember to comment your code.
  12. Do use <- instead of = for assignment.
  13. Do clear the environment at the beginning of a script: rm(list = ls())

An exhaustive style guide on how to write code can be found here.

Figure1.1: Play around with codeFigure1.2: Analogy of difference between R and RStudioFigure1.3: Icons of R versus RStudio on your computerFigure1.4: RStudio interface to RFigure1.5: The R Documentation of q()Figure1.6: The tidyverse universe

  1. CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R, see: https://cran.r-project.org.↩︎

How to Use R for Data Science (2024)
Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 5848

Rating: 4.9 / 5 (69 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.