My old R scripts: 1. Einfuhrung in R (R Code) (1).r

I have a look at the oldest R script on my laptop and then realize half way through that I have not written it myself. Still major #throwback vibes and a lot of wtf moments.

In my first post, I introduced you to the two main ideas I have for this blog. One of which is to dig up some old R scripts on my laptop and go through them, reflecting on what was difficult for me back then and how I tried to overcome my difficulties. If you want to know more about why I am doing this, feel free to check out the old post. Because once, I want to try and not ramble too much. So let’s do some aRcheology. Haha.

EaRly days

(I’ll stop with the capitalizing the R thing right now.)

I started learning R back at university in an introductory lecture on “survey research” (German: Umfrageforschung) in my second university year the winter term 2012 - 2013. The lecture was part of a specialization on survey research which was marketed to us as the way to go if you were interested in doing the quantitative stuff. Which I definitely was after the two introductory lectures to methods and statistics. So I decided to join and spend my third semester basically reading and absorbing all of the great Groves et al.’s Survey Methodology1. And learning trying to learn R in the accompanying R class.

Today’s R script are notes / first steps in R from the very first of the sessions of this R class2.

Disclaimer: Future Frie here. Almost at the end of writing this post, it dawned on me that I have most likely not written any of the code or comments in this script. I think all of it was provided to us by our professor Peter Selb.3 So I can’t actually reflect on my own code in this case but I’ll try and think about what was difficult for me. I’ll also include some commentary on how I personally do this today and some resources / packages that I deem useful that have since been developed. Finally, it goes without saying that my massive lack of understanding back then is in no way reflective of the teaching skills of our lecturer. I just needed a lot more time than the time available in one semester in order to find access to coding in general and R in particular.

(Important Note: If you’re beginner looking to learn new stuff: This was in 2012, so while some things are still very good know (the basics haven’t changed after all), much has happened since then. The R for Data Science book is a good place to learn about all the new, exciting R packages that make it (imho) soooo much more easier to start writing R code.)

The script: 1. Einfuhrung in R (R Code) (1).r

Two funfacts before we delve into the very few little bits of code in there:

OK, let’s buckle up and get started!

What is R and working in R

# ?BUNG ZUR VORLESUNG STICHPROBENTHEORIE

# I. EINF?HRUNG IN R


# WAS IST R?
# - Softwareumgebung zur numerischen und grafischen Datenanalyse
# - Statistische Programmiersprache basierend auf S
# - Open Source
# - l?uft auf allen g?ngigen Plattformen
# - tolle Grafiken
# - derzeit etwa 4'000 nutzergeschriebene Programme, Peer Review

# EDITOR
# - GUI mittelm??ig, daher besser mit externem Editor
# - f?r Windows z.B. Tinn-R (http://www.sciviews.org/Tinn-R/)
# - oder Notepad++ (http://notepad-plus-plus.org/), Batch-Programm NPPtoR (http://sourceforge.net/projects/npptor/)
# - f?r Mac z.B. TextWrangler (http://www.barebones.com/products/TextWrangler/), Applescript und Syntaxhighliting (http://macsci.jelmerborst.nl/files/textwrangler_and_r.php)  


# ALLGEMEINES ZUR EINGABE
# - R starten trivial, wenn man mit RGUI arbeitet; mit Tinn-R Men?punkt 'R' -> 'Start/close' -> 'Rterm(start)' bzw. 'Rgui(start)'
# - Eingabe direkt ?ber Befehlsmodus oder ?ber Skript
# - Zur Entscheidung, ob 'Rterm' oder 'Rgui' unter Tinn-R siehe http://sourceforge.net/projects/tinn-r/forums/forum/481900/topic/3699375 
# - R ist bereit, wenn Eingabeaufforderung > auf der Konsole erscheint
# - zum Stoppen von R q() eingeben
# - Eingabe ist unvollst?ndig, wenn + erscheint
# - R ist schreibungsabh?ngig (gro?/klein)
# - Dezimalpunkte, nicht Kommata
# - Abruf bereits verwendeter Befehle durch Pfeiltasten (hoch/runter)
# - Eingabe von Pfadnamen mit doppelten Backslashes \\ oder einfache Slashes / (aber nicht mit einfachen Backslashes)
# - # offensichtlich zum Auskommentieren

The file starts with a whole bunch of comments / notes. Sadly, they are in German but to summarize:

Finding help in R

# HILFE
# F?r generelle Hilfe in HTML-Format:
help.start() 
# Demos, z.B. Grafiken
demo(graphics)
# Beispiele, etwa Stichprobenziehung
example(sample)
# Hilfe zu einer spezifischen Funktion, z.B.
help(sample)
# oder einfach
?sample
# Suche nach Schlagw?rtern
help.search("sampling")
# oder einfach
??sampling
# N?tzliche Referenzkarte mit den wichtigsten Funktionen unter http://cran.r-project.org/doc/contrib/Short-refcard.pdf 
# Mailing list:
# https://stat.ethz.ch/mailman/listinfo/r-help
# Eine gute Einf?hrung in R (100 S.) gibt z.B. cran.r-project.org/doc/contrib/usingR.pdf
# ... oder wesentlich k?rzer und auf deutsch: Andreas Handl (n.d.) Eine kleine Einf?hrung in R (http://www.wiwi.uni-bielefeld.de/~frohn/Mitarbeiter/Handl/stagrund.html)
# Aus beiden Dokumenten habe ich mich f?r die heutige Sitzung gro?z?gig bedient

The next part was all about finding help in R. Interestingly, I can’t remember having ever used help.start, demo or example. So just for the record and out of curiosity: what do they do?

From all the tools presented, I only use ? and very rarely ?? today. Other ways I use to find out stuff include GitHub READMEs, ebooks, stackoverflow and many more. I’ll probably write a separate blog post on this topic because this post is quite long already.

Vectors and factors

# R ALS TASCHENRECHNER
# R l?sst sich als m?chtiger Taschenrechner verwenden
# Mathematische Funktionen wie in Stata oder jeder anderen Software, z.B. log(), exp(), sqrt(), abs() etc.


# DATENSTRUKTUREN
# VEKTOREN
# - ein V. ist eine Zusammenfassung von Objekten zu einer Folge von Komponenten
# - wird mit Funktion c() generiert
c(20,35,51,43)
# - Speichern des V. in einer Variablen mittels Zuweisungsoperator <-
age <- c(20,35,51,43)
age
# - bei Zeichenketten "" verwenden
sex <- c("m","w","w","m")
sex
# - als Faktor
sex <- factor(sex)

At last, some “real” R code. We learned how to create a vector which is defined in the notes as a “collection of objects that form a sequence of components”. Today, I understand this definition but back then - as hard it is to imagine for me now - I think I was massively confused by the concept and the creation of vectors.

We go on to assign our vector to a variable called age. Then, character vectors. Oh boy, I used to forget the "" a lot. I think it only got better once RStudio introduced auto-completion for "".

Finally, we convert our character vector to a factor. Factors, my personal nemesis. I have read the part on factors in Advanced R probably 3 times and I still can’t work with them. I know there is a need for them in modelling but for data cleaning, they’re just the worst and I have tripped so many times. Today, I just avoid factors at all costs: If there is a variable that could be stored as a factor, I’d rather use two variables instead of dealing with levels, labels and all this. If I had to work with them again, I’d probably use the forcats package and hope that Hadley Wickham again found a way to make a very complex thing much easier.

Data entry, matrices and data frames

# alternative Eingabe ?ber ein Spreadsheet
data.entry(age)
# mathematische Operationen auf Vektoren
d.age <- age - mean(age)
d.age
# Sortieren mit Funktion sort()
sort(age)
sort(age, decreasing=TRUE)

# MATRIZEN
# Eingabe per Voreinstellung spaltenweise
age.family <- matrix(c(45,54,42,50,15,20), nrow=2, ncol=3)
age.family
# Einfache Operationen, z.B. Summenbildung ?ber Zeilen und Spalten
rowSums(age.family)
colSums(age.family)
# Dasselbe mit apply
? apply
apply(age.family,1,sum)
apply(age.family,2,sum)

# DATENTABELLEN
sexage <- data.frame(age=c(20,35,51,43),sex=c("m","w","w","m"))
sexage

Brace yourself for data.entry… Ready?

data entry

via GIPHY

Excuse me? What is this?

Well,… if I think about it, it was probably something I’d have found pretty exciting back then. After all, the only way I had intereacted with data before was via Excel so a familiar thing was probably highly welcome. Still, I would never use it today because of reproducability issues. Well, let’s just continue.

We see a matrix command - something I almost never use because why would I use matrices if I can have a data frame (I know, I know… high dimensional data dimensions, maybe performance reasons)? Maybe that was the story of the lecturer as well because we end up creating a nice little data frame.

Other things to note:

Indexing

# INDIZIERUNG
# in Vektoren
age[2]         # nur das 2. Element
age[-2]        # nicht das 2. Element
age[1:3]       # Elemente 1 bis 3
age[c(1,2,3)]  # dasselbe
age[age<30]    # Elemente, f?r die Bedingung erf?llt
which(age<30)  # Position der Elemente, f?r die Bedingung erf?llt
# in Matrizen
age.family[2,3]
age.family[1,]
age.family[,2]
# in Datentabellen
sexage[[1]]
sexage[[2]]
sexage$age

Ahh, indexing. Another topic I could not wrap my head around for quite some time. I just memorized the dataframe$variablename pattern and whenever I had to use the square brackets, I just tried different combinations until I got the element(s) I was looking for. Only several years later - maybe in 2015?! - after working through the “Data structures” and “Subsetting” chapters of the Advanced R book, I really understood why we use $ for data frames (spoiler: data frames are just lists) and when to use one [] and when to use [[]]. Today, I mostly use dplyr for data frames and purrr for lists, so I rarely have to use “old school” indexing anymore. But I still think it was a major step for me to finally understand the basic underlying R data structures and the various ways to subset them.

Attach, detach, reading and writing data

# Funktionen attach() und detach()
attach(sexage)
age
detach(sexage)


# DATEN EINLESEN UND EXPORTIEREN

# EINLESEN VON 'INTERNEN' DATEN, die in einem der eigebundenen Pakete enthalten sind
# hier: Wahlkreisdaten zu den Bundestagswahlen 2002 und 2005 aus dem Paket samplingbook
data(election)
head(election)

# IMPORT VON 'EXTERNEN' DATEN
# z.B. aus Stata mit der Funktion read.dta() (gibt es auch f?r viele andere Formate, z.B. read.table() (die flexibelste Funktion), read.csv(), read.spss(), read.dbf() etc.) 
?read.dta
# z.B. df.name <- read.dta("location/filename.dta")
# das K?rzel 'df' ist v?llig arbritr?r; es steht f?r einen dataframe, das R-Pendant zu einem (rektangul?ren) Datensatz  
# hier unbedingt auf die Windows-un?bliche Slash-Konvention achten (s.o.)! 

# COPY-PASTING
# aus der Zwischenablage mit read.table("clipboard")

# DATEN SICHERN
save.image("c:/.../blabla.RData")
# Sp?terer Zugriff durch load() bzw. data() (s.o.)

# DATENEXPORT
# z.B. nach Stata
?write.dta

Ahh, attach and detach. IIRC, it saved you typing the dataframe$ part of the aforementioned dataframe$variablename pattern. In a time where you had to write subsetting statements like dataframe[dataframe$variable != dataframe$variable2] - without autocomplete -, this was kind of cool. Let’s try it quickly, shall we?

df <- dplyr::data_frame(x = c(1, 2, 3), y = c(4, 5, 6))
attach(df)
print(x)
[1] 1 2 3
detach(df)

Yep, that checks out. That being said, I wouldn’t use it today because I often have several data frames and lists at the same time and under those circumstances, I don’t think messing with the environments and the search path is a clever idea. But back then, when we mostly only had one data frame loaded at the same time it definetely made sense.

Reading in data from a stata file with read.dta was next on the agenda. As a beginner with average computer skills, I struggled quite a bit with file paths. Over time, this got better (especially after switching to Linux) but I am still very grateful that RStudio projects have eliminated that problem for 99% of cases - unless you’re like me and like to reorganize your files in the project directory all. the. time. Then even RStudio projects can’t save you from constantly breaking your code. Ps: RStudio projects also eliminated the need for the setwd command which is a pain if you collaborate with others on R scripts. Btw, did you know that using setwd can lead to someone setting your computer on fire? :wink:

As for reading data, I mainly use the tidyverse packages for those tasks as they have sensible defaults (no factors!!): readr for csv and other text formats and any package without the horrible RJava dependency for excel files, e.g. readxl or openxlsx. For SPSS and Stata files, there’s the haven package.

Conditions

Conditions. Probably 3==4 was ok for me but the rest most certainly not as it required some understanding of indexing. Which I had not (see above). Again, learning the basics about R data structures and indexing (see above) massively helped me with getting better at writing good conditions.

# BEDINGUNGEN
# Spezifikation durch folgende Operatoren
# ==  gleich
# !=  ungleich
# <    kleiner
# >    gr??er
# >=  gr??er gleich
# Verkn?pfung von Bedingungen durch 
# &    und
# |    oder
3==4
sex=="m"
age[sex=="m"]
# Funktionen any() und all()
any(age<30)

# Mittels dieser Bedingungen k?nnen wir Vektoren teilen
age.m <- age[sex=="m"]
mean(sex.m)
age.w <- age[sex=="w"]
mean(age.w)
# Weitere M?glichkeiten zur Auswahl von Teilmengen durch Funktionen split() und subset()

Plotting and installing packages

# EINFACHE GRAFIKEN
# Grundfunktion plot(x,y)
plot(age.family[,1], age.family[,2])
# anderer Punktetyp
plot(age.family[,1], age.family[,2])
# Linie
plot(age.family[,1], age.family[,2], type="l")
# mit Beschriftung
plot(age.family[,1], age.family[,2], type="l", main="Alter", xlab="Vater", ylab="Mutter")
# mit Legende
plot(age.family[,1], age.family[,2], type="l", main="Alter", xlab="Vater", ylab="Mutter")
legend(45,50, c("blablabla"))


# PAKETE LADEN
# Vorinstallation von R enth?lt relativ wenige Pakete / Funktionen
# Diese m?ssen bei Bedarf ?ber einen Server installiert und f?r jede Sitzung geladen werden (v.a. um Arbeitsspeicher zu sparen)
# ?berblick ?ber Pakete auf http://cran.r-project.org/web/packages/
# Oder R-spezifische Suchmaschine http://www.rseek.org
# Am besten men?gesteuert vorgehen: Pakete -> Installiere Paket(e) -> CRAN mirror ausw?hlen (z.B. Switzerland (Z?rich)) -> Packages ausw?hlen (z.B. 'foreign' zum Importieren von Daten)
# oder aber per Kommando:
chooseCRANmirror()
install.packages("foreign")
# Pakete entfernen:
# remove.packages("foreign")
# Installierte Pakete stehen nicht automatisch zur Verf?gung; diese m?ssen f?r jede R-Sitzung eingebunden werden!
# Ben?tigte Pakete einbinden, z.B.
library(foreign)

# PAKETE ZUM THEMA STICHPROBENTHEORIE
install.packages(c("sampling", "survey", "sampfling", "pps", "samplingbook"))
# c() ist eine generische Funktion; kombiniert alle Elemente in einen Vektor
library(sampling)    # Funktionen zur Ziehung und Kalibrierung von Stichproben
library(survey)      # Vielf?ltige Funktionen zur Analyse von Surveydaten
library(samplingbook)  # Funktionen zum Buch von Kauermann und K?chenhoff (2011)
# Viele Pakete sind im Journal of Statistical Software besprochen

Finally, we learned some basic plotting and how to install and load R packages. Interestingly, there is a recommendation to use the menu to install packages instead of using install.packages. Again, probably something younger me was thankful for as it was closer to what I knew back then (menus, doing everything with the mouse). Current me is a much more keyboard-centric person and I always just type in install.packages in the console. Yes, only in the console, not in the code. This is because I personally don’t like it when I execute an R script and it just starts installing stuff on my computer. I’d rather get several there is no package called x errors than having something installed on my machine without me explicitly doing so.

End

Well, that is the end of the first blog post of the “My old R scripts” series. Writing this, I realized two things:

  1. Learning but also teaching R back then was very, very different. Not necessarily harder, but definitely different. Many things have happened since then: tidyverse, RStudio, shiny, RMarkdown and so many more.
  2. Starting out with R was very hard for me. I had not been not a “nerd” before: I had started studying political science because I was a truly political person and because I was interested in studying politics, not because I wanted to write complicated code. This is why me and R had a difficult start: I hardly understood anything. In fact, I continued to use Stata instead of R for 1-2 more years after this initial class before I gave R another chance and grew to love it. But more about that in a future post.

I hope this was in some way interesting to you. Next time we’ll look at some R code that I actually wrote myself. Until then: keep coding and remember to be compassionate with your younger R self. They were doing their best. ❤️


  1. one of the few books I kept after my studies because it grew close to my heart↩︎

  2. It is is actually not the oldest R script - based on the modification date - I have on my laptop but I think it is the oldest one I have touched. The other ones are just solutions to R assignments (for another class) that somehow ended up on my computer… :D↩︎

  3. I contacted him and he said that they had adapted a lot of the code from Kauermann & Küchenhoff 2011 and from a tutorial held by Monia Mahling at LMU. Let’s hope that is ok for attribution purposes. If you recognize any of this code as your own, please contact me fr1e at pm dot me.↩︎

  4. I still don’t follow all her recommendations though…↩︎

  5. for all not acquainted with this joke: vim is a text editor that you can use from the command line / terminal. It is not pretty intuitive at the beginning (or ever, really) and one of the hardest things as a beginner is to find out how to actually quit the editor.↩︎

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://gitlab.com/friep/blog, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".