2. Descriptive Statistics: Frequency Data (Counting)

2.4 RStudio Lesson 1: Getting Started with RStudio

Osama Bataineh

[Please note: the R Studio Lessons of this book are currently a work-in-progress. Please refer to the SPSS Lessons instead.]

R is a free open source programing language widely used in statistical data analysis. It is very user friendly and it can be used in a variety of platforms such as MAC, Windows, LINUX, etc. R can be downloaded from https://www.r-project.org/. In most of the cases, R is being used with RStudio. RStudio is basically an additional interface which makes R more user friendly with a lot of additional features. A desktop version of RStudio can be downloaded (for free) at https://rstudio.com/products/rstudio/download/. More details are in the Front Matter of this book, in the section “Statistical Software Used in this Book”.

When you open RStudio, you will see this :

RStudio screenshot © the R Foundation.

Here you can see four windows. In the upper left window, you will write your R commands. The output is shown in the console in the lower left. There are also ways to write commands in the console and subsequently get output in the following lines. The upper right window shows description of the dataset that you are working on such as number of variables, number of observations etc. Finally, the lower right window is showing different packages of R at the moment. We will talk more about the packages little later. The lower right window also has other purposes. For example, if you write commands to produce graphs, it will show here. It also shows the results of help command when you seek help regarding anything in R.

Now let’s discuss a bit about how R functions. As the name suggests, R is a programming language. In other statistical packages such Stata, SPSS, EViews, Tableau etc., you have to choose the right options based on what results you want. However in R, rather than picking the right options and clicking on them, you have to write commands to get the desired output. To aid this process, there are different packages in R. These R packages were built for various purposes based on what analysis you want to do. Some of these packages are already built-in and installed. However, to use an existing installed package, you have to load it in every work secession before starting to use it. There are other packages of R which are not installed but available online. If you think these packages serve your purpose, then you have to install and load it before using it. Next time when you use these newly installed packages, you only have to load it in each work secession before starting to use it.

Now let’s get started working with our dataset. First, download the dataset “HyperactiveChildren.sav” from the textbook Data Sets. Then open RStudio and go to File > Import Dataset > From SPSS.

RStudio screenshot © the R Foundation.

After that, click the browse button in the pop-up window that will appear and select the dataset from the directory in which you have saved the data in your computer. Then click Import to insert the dataset in R. You can also do the same thing by manually executing the commands written in the Code Preview section by yourself.

RStudio screenshot © the R Foundation.

After inserting the data, you will see this in RStudio.

RStudio screenshot © the R Foundation.

Now we will write our commands in a new window in the upper left interface and save it in our desired folder in the computer to avoid rewriting the commands. This is known as R Script. To open a new R Script, click the arrow located just below and in between File and Edit and then select the first option R Script.

RStudio screenshot © the R Foundation.

After selecting the R script, a new R Script will open in a separate window named Untitled1. Save this script with a suitable name in your desired directory on your computer by clicking the old fashioned floppy disk icon. I have saved the script with the name Lesson 1\_RScript in my computer.

RStudio screenshot © the R Foundation.

Before starting to work, let’s get to know couple of things. Among these things, some are obligatory to know if anybody wants to work in R while others can make our life a lot easier and efficient. You can run the commands a lot quickly through the keyboard by clicking Ctrl and Enter. Other than the commands, you can also write notes in the R Script for your future references. To write anything other than the commands, just give a Hash (\#) sign in the beginning. Another thing you must know is that to work with any dataset in R after inserting it in the beginning, you have to attach it in the current work secession to work with it further. To attach the Smoking dataset for the current work secession, run the following command.

> attach(Hyperactive_Children)

Now let’s start working with our dataset. Similar to other software packages, a variable has to be numeric to use it for statistical analysis. Thus any qualitative string variable needs to be transformed into numeric quantitative variable if we want to conduct analysis with it. In our dataset, the variable sex  is a qualitative string variable with two categories male and female. Let’s create a new variable sex\_new which will take the value 1 for male and 2 for female. We can do it with the help of the function ifelse.

> Hyperactive_Children$Sex_new <- ifelse(Hyperactive_Children$Sex==1,1,ifelse(Hyperactive_Children$Sex
==2,2,NA))

A new variable Hyperactive_Children\_new is being created. If we view the data by running the following command, we will see that at the end an additional column is being created named Hyperactive_Children which has numeric values of 1 and 2 only.

> View(Hyperactive_Children)
RStudio screenshot © the R Foundation.

Here a thing to be noted is the use of dollar sign ($). The dollar sign basically calls the variable mentioned after it from the dataset mentioned prior to it or generates the new one mentioned after the sign to the dataset mentioned prior to it.

To get the frequency distribution of any variable in R similar in the way it’s shown in many statistics textbook, you have to write codes for each of the columns separately. For example- to get the frequencies, cumulative frequencies, relative frequencies etc. – you have to write separate commands unless you are a program wizard and able to create an R package which will produce such table. First to get the frequency of a variable for its different categories, we can use the function table.

> Age.freq <- table(Hyperactive_Children$Age)

> Age.freq 

6 7 8 9 10 11 12
2 1 2 3  3  2  2

> cbind(Age.freq)

Age.freq
6      2
7       1
8      2
9       3
10     3
11      2
12     2

Here for a better presentation purpose, we have used another function cbind which basically shows the values of the variable and the number of observation it contains in a column. Then to get the cumulative frequencies, we can utilize the function cumsum.

> Age.cumfreq <- cumsum(table(Hyperactive_Children$Age))
> cbind(Age.freq, Age.cumfreq)

Age.freq     Age.cumfreq
6        2                   2
7         1                   3
8        2                   5
9        3                   8
10      3                   11
11        2                  13
12       2                  15

To produce pie chart, there is a specific function pie in R. However, since we need the pie chart of the frequency, we have to input table(Hyperactive_Children$Age) inside pie. The output will be shown in the lower right window.

> pie(table(Hyperactive_Children$Age))

RStudio screenshot © the R Foundation.

Now let’s have a look at the descriptive statistics of this variable Educ. To calculate descriptive statistics, we need to install a specific package named psych. After installing and loading this package, we have to use the function describe which is a part of this new package. If any warning signs come, please ignore it.

> library(psych)

> describe(Hyperactive_Children$Age)

vars  n  mean  sd     median  trimmed   mad    min max  range skew   kurtosis    se

1    15    9.2    1.93          9         9.23        1.48          6       12      6    -0.21       -1.18      0.5

Similarly to get the confidence interval, you need another new package named Rmisc. While installing this package you have to keep the dependencies to true. After installing and loading this package, we have to use one of it’s function CI to get the confidence interval. Also remember that in addition to inserting Hyperactive_Children$Age in the domain of the function, you have to also specify the level of confidence interval. Ignore the warning signs here too.

> install.packages(‘Rmisc’, dependencies = TRUE)
> library(Rmisc)

> CI(Hyperactive_Children$Age, ci = 0.95)
upper        mean        lower
10.271372 9.200000 8.128628

Finally, to produce the histogram, stem and leaf display and boxplot, there are specific functions in R with their names. These functions are hist, stem and boxplot respectively.

> hist(Hyperactive_Children$Age)

RStudio screenshot © the R Foundation.

> stem(Hyperactive_Children$Age)

The decimal point is at the |

6 | 000
8 | 00000
10 | 00000
12 | 00

> boxplot(Hyperactive_Children$Age)

RStudio screenshot © the R Foundation.