1.1 What is R? R is a freely available “computational language and environment for data analysis and graphics.” R is indispensable for anyone that uses and interprets data. As medical, public health, and research epidemiologists, we use R in the following ways:. Full-function calculator.
Extensible statistical package. High-quality graphics tool. Multi-use programming language We use R to explore, analyze, and understand epidemiological data. We analyze data straight out of tables provided in reports or articles as well as analyze usual data sets. The data might be a large, individual-level data set imported from another source (e.g., cancer registry); an imported matrix of group-level data (e.g, population estimates or projections); or some data extracted from a journal article we are reviewing.
The ability to quantitatively express, graphically explore, and describe epidemiologic data and processes enables one to work and strengthen one’s epidemiologic intuition. In fact, we only use a very small fraction of the R package. For those who develop an interest or have a need, R also has many of the statistical modeling tools used by epidemiologists and statisticians, including logistic and Poisson regression, and Cox proportional hazard models. However, for many of these routine statistical models, almost any package will suffice (SAS, Stata, SPSS, etc.).
The real advantage of R is the ability to easily manipulate, explore, and graphically display data. Repetitive analytic tasks can be automated or streamlined with the creation of simple functions (programs that execute specific tasks). The initial learning curve is steep, but in the long run one is able to conduct analyses that would otherwise require a tremendous amounts of programming and time. Some may find R challenging to learn if they are not familiar with statistical programming. R was created by statistical programmers and is more often used by analysts comfortable with matrix algebra and programming. However, even for those unfamiliar with matrix algebra, there are many analyses one can accomplish in R without using any advanced mathematics, which would be difficult in other programs.
The ability to easily manipulate data in R will allow one to conduct good descriptive epidemiology, life table methods, graphical displays, and exploration of epidemiologic concepts. R allows one to work with data in any way they come. 1.3 Who should learn R? Anyone that uses a calculator or spreadsheet, or analyzes numerical data at least weekly should seriously consider learning and using R. This includes epidemiologists, statisticians, physician researchers, engineers, health economists, health systems analysts, business analysts, and faculty and students of mathematics and science courses, to name just a few. We jokingly tell our staff analysts that once they learn R they will never use a spreadsheet program again (well almost never!).
1.4 Why should I learn R? To implement numerical methods we need a computational tool. On one end of the spectrum are calculators and spreadsheets for simple calculations, and on the other end of the spectrum are specialized computer programs for such things as statistical and mathematical modeling. However, many numerical problems are not easily handled by these approaches. Calculators, and even spreadsheets, are too inefficient and cumbersome for numerical calculations whose scope and scale change frequently.
Statistical packages are usually tailored for the statistical analysis of data sets and often lack an intuitive, extensible, open source programming language for tackling new problems efficiently. R can do the simplest and the most complex analysis efficiently and effectively. When we learn and use R regularly we will save significant amounts of time and money. It’s powerful and it’s free! It’s a complete environment for data analysis and graphics. Its straightforward programming language facilitates the development of functions to extend and improve the efficiency of our analyses. 1.5 Where can I get R?
R is available for many computer platforms, including Mac OS, Linux, Microsoft Windows, and others. R comes as source code or a binary file. Source code needs to be compiled into an executable program for our computer. Those not familiar with compiling source code (and that’s most of us) just install the binary program. We assume most readers will be using R in the Mac OS or MS Windows environment. Listed here are useful R links:.
R Project home page at. R download page at. Numerous free tutorials are at. R Wikibook at. R Journal at To install R for Windows or Mac OS, do the the following:. Go to;. From the left menu list, click on the “CRAN” (Comprehensive R Archive Network) link;.
Select a nearby geographic site (e.g., );. Select appropriate operating system;. Select on “base” link;. For Windows, save R-X.X.X-win.exe to the computer; and for Mac OS, save the R-X.X.X-pkg installer package. Run the installation program and accept the default installation options. 1.6.2 Does R have epidemiology programs? The default installation of R does not have packages that specifically implement epidemiologic applications; however, many of the statistical tools that epidemiologists use are readily available, including statistical models such as unconditional logistic regression, conditional logistic regression, Poisson regression, Cox proportional hazards regression, and much more.
Table lists selected R packages with biostatistical or mathematical modeling methods applied to epidemiologic problems. The focus of this book is learning how to use R without relying on a specific packages. Learning the R basics covered in this book will help one take full advantage of these and other R packages, some of which address advanced topics such as network modeling of epidemics. 1.6.3 How should I use these notes?
The best way to learn R is to use it! Use it as your calculator!
Use it as your spreadsheet! Finally read these notes sitting at a computer and use R interactively (this works best sitting in a cafe that brews great coffee and plays good music). Although we initially encourage you to use R interactively by typing expressions at the console, as a general rule, it is much better to type your code as a R script. Save your code with a convenient file name such as job01.R.
RStudio comes with a text editor for creating and editing R scripts. Our focus will be learning how to use RStudio to edit and run R scripts. The code in your text editor can be run in the following ways: - Highlight and run selected expressions in the RStudio; - Copy and paste the code directly into R console; - Run the file in batch mode from the R console using the source function (e.g., source('job01.R')).
Table 1.2: Selected math operators Operator Description Try these examples + addition 5+4 - subtraction 5-4. multiplication 5.4 / division 5/4 ^ exponentiation 5^4 - unary minus (change current sign) -5 abs absolute value abs(-23) exp exponentiation ( (e ) to a power) exp(8) log logarithm (default is natural log) log(exp(8)) sqrt square root sqrt(64)%/% integer divide 10%/%3%% modulus 10%%3%.% matrix multiplication xx. 1.7.2.1 Types of evaluable expressions Every expression that is entered at the R console is evaluated by R and returns a value. A literal is the simplist expression that can be evaluated (number, character string, or logical value). Mathematical operations involve numeric literals. For example, R evaluates the expression 4.4 and returns the value 16.
The exception to this is when an evaluable expression is assigned an object name: x. Objects # equivalent ## 1 'aa' 'ages' 'bb' 'epi.packs' 'price' 'quantity' ## 7 'subtotal' 'x' 'xx' Data objects can be saved between sessions.
We will be prompted with “Save workspace image?” You can also use save.image at the console prompt. The workspace image is saved in a file called.RData. Use getwd to display the file path to the.RData file. Table has more useful R functions.
Table 1.4: Useful R functions Function Description Try these examples q Quit R q ls List objects ls objects objects #equivalent rm Remove object(s) yy. 1.7.5 Is there anything else that I need? RStudio has everything you will need to use R productively.
Some analysts will select to use R with a text editor, rather than RStudio. Like RStudio, a good text editor makes programming and data processing easier and more efficient. If you are considering a text editor, the functionality we look for in a text editor are the following:. Toggle between wrapped and unwrapped text.
Block cutting and pasting (also called column editing). Easy macro programming. Search and replace using regular expressions. Ability to import large datasets for editing When we are programming we want our text to wrap so we can read all of your code. When we import a data set that is wider than the screen, we do not want the data set to wrap: we want it to appear in its tabular format. Column editing allows us to cut and paste columns of text at will.
A macro is just a way for the text editor to learn a set of keystrokes (including search and replace) that can be executed as needed. Searching using regular expressions means searching for text based on relative attributes.
For example, suppose you want to find all words that begin with “b”, end with “g”, have any number of letters in between but not “r” and “f”. Regular expression searching makes this a trivial task. These are powerful features that once we use regularly, we will wonder how we ever got along without them. If we do not want to install a text editing program then we can use the default text editor that comes with our computer operating system (gedit in Ubuntu Linux, TextEdit in Mac OS, Notepad in Windows).
However, it is much better to install a text editor that works with R. My favorite text editor is the free and open source GNU Emacs. GNU Emacs can be extended with the “Emacs Speaks Statistics” (ESS) package. For more information on Emacs and ESS pre-installed for Windows and Mac OS, visit. 1.7.6 What’s ahead? To the novice user, R may seem complicated and difficult to learn. In fact, for its immense power and versatility, R is easier to learn and deploy compared to other statistical software (e.g. SAS, Stata, SPSS).
This is because R was built from the ground up to be an efficient and intuitive programming language and environment. If you understand the logic and structure of R, then learning proceeds quickly. Just like a spoken language, once you know its rules of grammar, syntax, and pronunciation, and can write legible sentences, you can figure out how to communicate almost anything. Before we get into the “trees” (next chapter), we want to describe the “forest”: the logic and structure of working with R objects and epidemiologic data. 1.7.6.1 Working with R objects For our purposes, there are only five types of data objects in R and five types of actions we take on these objects (Table ). No more, no less.
You will learn to create, name, index (subset), replace components of, and operate on these data objects using a systematic, comprehensive approach. As you learn about each new data object type, it will reinforce and extend what you learned previously. Table 1.5: Types of actions taken on R data objects Action Vector Matrix Array List Data frame Creating Table Table Table Table Table Naming Table Table Table Table Table Indexing Table Table Table Table Table Replacing Table Table Table Table Table Operating on Table Table Table Table Table Table A vector is a collection of elements (often numbers).
Curve( log(x /( 1 -x)), 0, 1) What kind of generalizations can you make about the ( log )(odds) as a transformation of risk? Table 1.7: Estimated per-act risk (transmission probability) for acquisition of HIV, by exposure route to an infected source.
Source: CDC Exposure route Risk per 10,000 exposures Blood transfusion (BT) 9,000 Needle-sharing injection-drug use (IDU) 67 Receptive anal intercourse (RAI) 50 Percutaneous needle stick (PNS) 30 Receptive penile-vaginal intercourse (RPVI) 10 Insertive anal intercourse (IAI) 6.5 Insertive penile-vaginal intercourse (IPVI) 5 Receptive oral intercourse on penis (ROI) 1 Insertive oral intercourse with penis (IOI) 0.5 Use the data in Table. Assume one is HIV-negative. If the probability of infection per act is (p ), then the probability of not getting infected per act is ((1-p) ). The probability of not getting infected after 2 consecutive acts is ((1-p)^2 ), and after 3 consecutive acts is ((1-p)^3 ).
Therefore, the probability of not getting infected infected after (n ) consecutive acts is ((1-p)^n ), and the probability of getting infected after (n ) consecutive acts is (1-(1-p)^n ). For each non-blood transfusion transmission probability (per act risk) in Table, calculate the cumulative risk of being infected after one year (365 days) if one carries out the same act once daily for one year with an HIV-infected partner. Do these cumulative risks make intuitive sense? Why or why not?. Recommendations for mostly free and open source software (FOSS). The.R extension, although not necessary, is useful when searching for R command files.
Additionally, this file extension is recognized by RStudio and many text editors. To improve readability, a period (.) or underscore ( ) symbol can be used in your object name. In some operating systems files names that begin with a period (.) are hidden files and are not displayed by default. You may need to change the viewing option to see the file.
The sixth type of R object is a function. Functions can create, manipulate, operate on, and store data; however, we will use functions primarily to execute a series of R “commands” and not as primary data objects.