In my first post, I introduced you to the two main ideas I have for this blog. One of which is to dig up some old R scripts on my laptop and go through them, reflecting on what was difficult for me back then and how I tried to overcome my difficulties. If you want to know more about why I am doing this, feel free to check out the old post. Because once, I want to try and not ramble too much. So let’s do some aRcheology. Haha.
(I’ll stop with the capitalizing the R thing right now.)
I started learning R back at university in an introductory lecture on “survey research” (German: Umfrageforschung) in my second university year the winter term 2012 - 2013. The lecture was part of a specialization on survey research which was marketed to us as the way to go if you were interested in doing the quantitative stuff. Which I definitely was after the two introductory lectures to methods and statistics. So I decided to join and spend my third semester basically reading and absorbing all of the great Groves et al.’s Survey Methodology1. And
learning trying to learn R in the accompanying R class.
Today’s R script are notes / first steps in R from the very first of the sessions of this R class2.
Disclaimer: Future Frie here. Almost at the end of writing this post, it dawned on me that I have most likely not written any of the code or comments in this script. I think all of it was provided to us by our professor Peter Selb.3 So I can’t actually reflect on my own code in this case but I’ll try and think about what was difficult for me. I’ll also include some commentary on how I personally do this today and some resources / packages that I deem useful that have since been developed. Finally, it goes without saying that my massive lack of understanding back then is in no way reflective of the teaching skills of our lecturer. I just needed a lot more time than the time available in one semester in order to find access to coding in general and R in particular.
(Important Note: If you’re beginner looking to learn new stuff: This was in 2012, so while some things are still very good know (the basics haven’t changed after all), much has happened since then. The R for Data Science book is a good place to learn about all the new, exciting R packages that make it (imho) soooo much more easier to start writing R code.)
The script: 1. Einfuhrung in R (R Code) (1).r
Two funfacts before we delve into the very few little bits of code in there:
- as you might have realized by now, the file is named “1. Einfuhrung in R (R Code) (1).r”. I was a Windows user back then and my folder and file names all contained spaces and capitalized letters because I did not even consider doing it differently. Why would I? Today, I’d tell my younger beginner self to check out this excellent (and funny) slide deck by Jenny Bryan on how to name files4.
- I tried for at least one hour to include the code straight from the R file. However, good old encoding stopped me and I was not in the mental place for encoding debugging. So copy + paste it is. To avoid long scrolling, I’ll go through it step by step. Also, please ignore the encoding issues. Again, not the mental place for it today.
OK, let’s buckle up and get started!
What is R and working in R
# ?BUNG ZUR VORLESUNG STICHPROBENTHEORIE # I. EINF?HRUNG IN R # WAS IST R? # - Softwareumgebung zur numerischen und grafischen Datenanalyse # - Statistische Programmiersprache basierend auf S # - Open Source # - l?uft auf allen g?ngigen Plattformen # - tolle Grafiken # - derzeit etwa 4'000 nutzergeschriebene Programme, Peer Review # EDITOR # - GUI mittelm??ig, daher besser mit externem Editor # - f?r Windows z.B. Tinn-R (http://www.sciviews.org/Tinn-R/) # - oder Notepad++ (http://notepad-plus-plus.org/), Batch-Programm NPPtoR (http://sourceforge.net/projects/npptor/) # - f?r Mac z.B. TextWrangler (http://www.barebones.com/products/TextWrangler/), Applescript und Syntaxhighliting (http://macsci.jelmerborst.nl/files/textwrangler_and_r.php) # ALLGEMEINES ZUR EINGABE # - R starten trivial, wenn man mit RGUI arbeitet; mit Tinn-R Men?punkt 'R' -> 'Start/close' -> 'Rterm(start)' bzw. 'Rgui(start)' # - Eingabe direkt ?ber Befehlsmodus oder ?ber Skript # - Zur Entscheidung, ob 'Rterm' oder 'Rgui' unter Tinn-R siehe http://sourceforge.net/projects/tinn-r/forums/forum/481900/topic/3699375 # - R ist bereit, wenn Eingabeaufforderung > auf der Konsole erscheint # - zum Stoppen von R q() eingeben # - Eingabe ist unvollst?ndig, wenn + erscheint # - R ist schreibungsabh?ngig (gro?/klein) # - Dezimalpunkte, nicht Kommata # - Abruf bereits verwendeter Befehle durch Pfeiltasten (hoch/runter) # - Eingabe von Pfadnamen mit doppelten Backslashes \\ oder einfache Slashes / (aber nicht mit einfachen Backslashes) # - # offensichtlich zum Auskommentieren
The file starts with a whole bunch of comments / notes. Sadly, they are in German but to summarize:
- “What is R”: we were told that R is “a software environment for numerical and graphical data analysis”. Which is still true of course but there is also much more possible today (see shiny, plumber, rmarkdown, …). Also, it was emphasized that R is open source (probably in contrast to SPSS / Stata) and that R has “great plots” (yes!).
- “Editor”: This was relevant because RStudio was not widespread enough yet to make an appearance at a German university. So we were told to use a text editor of our choice together with the R console. There was / is even an editor specifically developed for this: Tinn-R. Nowadays, I use RStudio for 99% of the time I code in R but I’ve also seen posts on using VSCode or Atom for R.
- “Some general comments on command entry”: For example, R uses
,as a decimal separator, case sensitivity, path names (
\\) etc. I probably immediately forgot those helpful tips and wrote 1001 bugs. Nowadays, I am much better at avoiding those mistakes, mostly thanks to two things: First, experience from writing a lot of R code but most importantly, RStudio’s autocompletion and error highlighting features.
Finding help in R
# HILFE # F?r generelle Hilfe in HTML-Format: help.start() # Demos, z.B. Grafiken demo(graphics) # Beispiele, etwa Stichprobenziehung example(sample) # Hilfe zu einer spezifischen Funktion, z.B. help(sample) # oder einfach ?sample # Suche nach Schlagw?rtern help.search("sampling") # oder einfach ??sampling # N?tzliche Referenzkarte mit den wichtigsten Funktionen unter http://cran.r-project.org/doc/contrib/Short-refcard.pdf # Mailing list: # https://stat.ethz.ch/mailman/listinfo/r-help # Eine gute Einf?hrung in R (100 S.) gibt z.B. cran.r-project.org/doc/contrib/usingR.pdf # ... oder wesentlich k?rzer und auf deutsch: Andreas Handl (n.d.) Eine kleine Einf?hrung in R (http://www.wiwi.uni-bielefeld.de/~frohn/Mitarbeiter/Handl/stagrund.html) # Aus beiden Dokumenten habe ich mich f?r die heutige Sitzung gro?z?gig bedient
The next part was all about finding help in R. Interestingly, I can’t remember having ever used
example. So just for the record and out of curiosity: what do they do?
help.start()opens up the start page of the R help. Wow, I didn’t even remember this existed. Well, unless you want to read “The R Language Definition” (I once tried…) or something similar, you probably won’t miss anything.
demoenters into a sort of interactive mode with text and graphs where you can learn more about a topic. Kind of like a slideshow. Interesting and probably quite useful before GitHub Readmes, vignettes and package websites became more common. My only quarrel with it: I was not able to quit it and had to go through all “slides”. Thanks but no. I occassionally use vim and that’s enough for having to learn how to quit something.5 😉
exampleprints the examples for a given function to the console. In principal not uninteresting but apparently only works for base R functions.
From all the tools presented, I only use
? and very rarely
?? today. Other ways I use to find out stuff include GitHub READMEs, ebooks, stackoverflow and many more. I’ll probably write a separate blog post on this topic because this post is quite long already.
Vectors and factors
# R ALS TASCHENRECHNER # R l?sst sich als m?chtiger Taschenrechner verwenden # Mathematische Funktionen wie in Stata oder jeder anderen Software, z.B. log(), exp(), sqrt(), abs() etc. # DATENSTRUKTUREN # VEKTOREN # - ein V. ist eine Zusammenfassung von Objekten zu einer Folge von Komponenten # - wird mit Funktion c() generiert c(20,35,51,43) # - Speichern des V. in einer Variablen mittels Zuweisungsoperator <- age <- c(20,35,51,43) age # - bei Zeichenketten "" verwenden sex <- c("m","w","w","m") sex # - als Faktor sex <- factor(sex)
At last, some “real” R code. We learned how to create a vector which is defined in the notes as a “collection of objects that form a sequence of components”. Today, I understand this definition but back then - as hard it is to imagine for me now - I think I was massively confused by the concept and the creation of vectors.
We go on to assign our vector to a variable called
age. Then, character vectors. Oh boy, I used to forget the
"" a lot. I think it only got better once RStudio introduced auto-completion for
Finally, we convert our character vector to a factor. Factors, my personal nemesis. I have read the part on factors in Advanced R probably 3 times and I still can’t work with them. I know there is a need for them in modelling but for data cleaning, they’re just the worst and I have tripped so many times. Today, I just avoid factors at all costs: If there is a variable that could be stored as a factor, I’d rather use two variables instead of dealing with levels, labels and all this. If I had to work with them again, I’d probably use the
forcats package and hope that Hadley Wickham again found a way to make a very complex thing much easier.
Data entry, matrices and data frames
# alternative Eingabe ?ber ein Spreadsheet data.entry(age) # mathematische Operationen auf Vektoren d.age <- age - mean(age) d.age # Sortieren mit Funktion sort() sort(age) sort(age, decreasing=TRUE) # MATRIZEN # Eingabe per Voreinstellung spaltenweise age.family <- matrix(c(45,54,42,50,15,20), nrow=2, ncol=3) age.family # Einfache Operationen, z.B. Summenbildung ?ber Zeilen und Spalten rowSums(age.family) colSums(age.family) # Dasselbe mit apply ? apply apply(age.family,1,sum) apply(age.family,2,sum) # DATENTABELLEN sexage <- data.frame(age=c(20,35,51,43),sex=c("m","w","w","m")) sexage
Brace yourself for
Excuse me? What is this?
Well,… if I think about it, it was probably something I’d have found pretty exciting back then. After all, the only way I had intereacted with data before was via Excel so a familiar thing was probably highly welcome. Still, I would never use it today because of reproducability issues. Well, let’s just continue.
We see a matrix command - something I almost never use because why would I use matrices if I can have a data frame (I know, I know… high dimensional data dimensions, maybe performance reasons)? Maybe that was the story of the lecturer as well because we end up creating a nice little data frame.
Other things to note:
- dots in object names:
age.family. This was quite common back then. Today, I almost exclusively use underscores, i.e. it would be
age_familywhich is what the tidyverse style guide recommends. But I am a huge fan of the concept “do what works for you as long as you do it consistently”. Ps: there is even a package called snakecase that helps you with converting your variable names to all sorts of naming “conventions”.
apply: I am pretty sure I did not understand this at all. I had to use several
plyrfunctions for my BA thesis over two years later and it was still a struggle. In general, functional programming took quite a while for me to warm up to but I’m a huge fan now and often find myself both the base R
lapplyand the functions from the
purrrpackage. The purrr tutorial of Jenny Bryan is an awesome way to start out with it.
# INDIZIERUNG # in Vektoren age # nur das 2. Element age[-2] # nicht das 2. Element age[1:3] # Elemente 1 bis 3 age[c(1,2,3)] # dasselbe age[age<30] # Elemente, f?r die Bedingung erf?llt which(age<30) # Position der Elemente, f?r die Bedingung erf?llt # in Matrizen age.family[2,3] age.family[1,] age.family[,2] # in Datentabellen sexage[] sexage[] sexage$age
Ahh, indexing. Another topic I could not wrap my head around for quite some time. I just memorized the
dataframe$variablename pattern and whenever I had to use the square brackets, I just tried different combinations until I got the element(s) I was looking for. Only several years later - maybe in 2015?! - after working through the “Data structures” and “Subsetting” chapters of the Advanced R book, I really understood why we use
$ for data frames (spoiler: data frames are just lists) and when to use one
 and when to use
[]. Today, I mostly use
dplyr for data frames and
purrr for lists, so I rarely have to use “old school” indexing anymore. But I still think it was a major step for me to finally understand the basic underlying R data structures and the various ways to subset them.
Attach, detach, reading and writing data
# Funktionen attach() und detach() attach(sexage) age detach(sexage) # DATEN EINLESEN UND EXPORTIEREN # EINLESEN VON 'INTERNEN' DATEN, die in einem der eigebundenen Pakete enthalten sind # hier: Wahlkreisdaten zu den Bundestagswahlen 2002 und 2005 aus dem Paket samplingbook data(election) head(election) # IMPORT VON 'EXTERNEN' DATEN # z.B. aus Stata mit der Funktion read.dta() (gibt es auch f?r viele andere Formate, z.B. read.table() (die flexibelste Funktion), read.csv(), read.spss(), read.dbf() etc.) ?read.dta # z.B. df.name <- read.dta("location/filename.dta") # das K?rzel 'df' ist v?llig arbritr?r; es steht f?r einen dataframe, das R-Pendant zu einem (rektangul?ren) Datensatz # hier unbedingt auf die Windows-un?bliche Slash-Konvention achten (s.o.)! # COPY-PASTING # aus der Zwischenablage mit read.table("clipboard") # DATEN SICHERN save.image("c:/.../blabla.RData") # Sp?terer Zugriff durch load() bzw. data() (s.o.) # DATENEXPORT # z.B. nach Stata ?write.dta
detach. IIRC, it saved you typing the
dataframe$ part of the aforementioned
dataframe$variablename pattern. In a time where you had to write subsetting statements like
dataframe[dataframe$variable != dataframe$variable2] - without autocomplete -, this was kind of cool. Let’s try it quickly, shall we?
df <- dplyr::data_frame(x = c(1, 2, 3), y = c(4, 5, 6))
## Warning: `data_frame()` is deprecated, use `tibble()`. ## This warning is displayed once per session.
##  1 2 3
Yep, that checks out. That being said, I wouldn’t use it today because I often have several data frames and lists at the same time and under those circumstances, I don’t think messing with the environments and the search path is a clever idea. But back then, when we mostly only had one data frame loaded at the same time it definetely made sense.
Reading in data from a stata file with
read.dta was next on the agenda. As a beginner with average computer skills, I struggled quite a bit with file paths. Over time, this got better (especially after switching to Linux) but I am still very grateful that RStudio projects have eliminated that problem for 99% of cases - unless you’re like me and like to reorganize your files in the project directory all. the. time. Then even RStudio projects can’t save you from constantly breaking your code. Ps: RStudio projects also eliminated the need for the
setwd command which is a pain if you collaborate with others on R scripts. Btw, did you know that using
setwd can lead to someone setting your computer on fire? 😉
As for reading data, I mainly use the tidyverse packages for those tasks as they have sensible defaults (no factors!!):
csv and other text formats and any package without the horrible RJava dependency for excel files, e.g.
[openxlsx](https://github.com/awalker89/openxlsx). For SPSS and Stata files, there’s the
3==4 was ok for me but the rest most certainly not as it required some understanding of indexing. Which I had not (see above). Again, learning the basics about R data structures and indexing (see above) massively helped me with getting better at writing good conditions.
# BEDINGUNGEN # Spezifikation durch folgende Operatoren # == gleich # != ungleich # < kleiner # > gr??er # >= gr??er gleich # Verkn?pfung von Bedingungen durch # & und # | oder 3==4 sex=="m" age[sex=="m"] # Funktionen any() und all() any(age<30) # Mittels dieser Bedingungen k?nnen wir Vektoren teilen age.m <- age[sex=="m"] mean(sex.m) age.w <- age[sex=="w"] mean(age.w) # Weitere M?glichkeiten zur Auswahl von Teilmengen durch Funktionen split() und subset()
Plotting and installing packages
# EINFACHE GRAFIKEN # Grundfunktion plot(x,y) plot(age.family[,1], age.family[,2]) # anderer Punktetyp plot(age.family[,1], age.family[,2]) # Linie plot(age.family[,1], age.family[,2], type="l") # mit Beschriftung plot(age.family[,1], age.family[,2], type="l", main="Alter", xlab="Vater", ylab="Mutter") # mit Legende plot(age.family[,1], age.family[,2], type="l", main="Alter", xlab="Vater", ylab="Mutter") legend(45,50, c("blablabla")) # PAKETE LADEN # Vorinstallation von R enth?lt relativ wenige Pakete / Funktionen # Diese m?ssen bei Bedarf ?ber einen Server installiert und f?r jede Sitzung geladen werden (v.a. um Arbeitsspeicher zu sparen) # ?berblick ?ber Pakete auf http://cran.r-project.org/web/packages/ # Oder R-spezifische Suchmaschine http://www.rseek.org # Am besten men?gesteuert vorgehen: Pakete -> Installiere Paket(e) -> CRAN mirror ausw?hlen (z.B. Switzerland (Z?rich)) -> Packages ausw?hlen (z.B. 'foreign' zum Importieren von Daten) # oder aber per Kommando: chooseCRANmirror() install.packages("foreign") # Pakete entfernen: # remove.packages("foreign") # Installierte Pakete stehen nicht automatisch zur Verf?gung; diese m?ssen f?r jede R-Sitzung eingebunden werden! # Ben?tigte Pakete einbinden, z.B. library(foreign) # PAKETE ZUM THEMA STICHPROBENTHEORIE install.packages(c("sampling", "survey", "sampfling", "pps", "samplingbook")) # c() ist eine generische Funktion; kombiniert alle Elemente in einen Vektor library(sampling) # Funktionen zur Ziehung und Kalibrierung von Stichproben library(survey) # Vielf?ltige Funktionen zur Analyse von Surveydaten library(samplingbook) # Funktionen zum Buch von Kauermann und K?chenhoff (2011) # Viele Pakete sind im Journal of Statistical Software besprochen
Finally, we learned some basic plotting and how to install and load R packages. Interestingly, there is a recommendation to use the menu to install packages instead of using
install.packages. Again, probably something younger me was thankful for as it was closer to what I knew back then (menus, doing everything with the mouse). Current me is a much more keyboard-centric person and I always just type in
install.packages in the console. Yes, only in the console, not in the code. This is because I personally don’t like it when I execute an R script and it just starts installing stuff on my computer. I’d rather get several
there is no package called x errors than having something installed on my machine without me explicitly doing so.
Well, that is the end of the first blog post of the “My old R scripts” series. Writing this, I realized two things:
- Learning but also teaching R back then was very, very different. Not necessarily harder, but definitely different. Many things have happened since then: tidyverse, RStudio, shiny, RMarkdown and so many more.
- Starting out with R was very hard for me. I had not been not a “nerd” before: I had started studying political science because I was a truly political person and because I was interested in studying politics, not because I wanted to write complicated code. This is why me and R had a difficult start: I hardly understood anything. In fact, I continued to use Stata instead of R for 1-2 more years after this initial class before I gave R another chance and grew to love it. But more about that in a future post.
I hope this was in some way interesting to you. Next time we’ll look at some R code that I actually wrote myself. Until then: keep coding and remember to be compassionate with your younger R self. They were doing their best. ❤️
one of the few books I kept after my studies because it grew close to my heart↩
It is is actually not the oldest R script - based on the modification date - I have on my laptop but I think it is the oldest one I have touched. The other ones are just solutions to R assignments (for another class) that somehow ended up on my computer… :D↩
I contacted him and he said that they had adapted a lot of the code from Kauermann & Küchenhoff 2011 and from a tutorial held by Monia Mahling at LMU. Let’s hope that is ok for attribution purposes. If you recognize any of this code as your own, please contact me fr1e at pm dot me.↩
I still don’t follow all her recommendations though…↩
for all not acquainted with this joke: vim is a text editor that you can use from the command line / terminal. It is not pretty intuitive at the beginning (or ever, really) and one of the hardest things as a beginner is to find out how to actually quit the editor.↩