Workshop #2: Collect live Twitter data

2 December 2016 @ 12-2pm

What's an API
What's a for-loop
Workshop Code

What's an API

An Application Programming Interface (API) is a tap that an internet service such as Facebook or Twitter (but also Transport for NSW) decides to install to facilitate access to its data on large scale. In order to use an API, you usually need to register and receive a set of unique credentials. The credentials allow you to interrogate the API within the limits established by the service. For example, Google Maps has an API to geolocate a string of text (let's say sydney opera house)). The service it's free but limited to 2500 requests per day; if you need a higher limit, you pay.

Ok, let me try out this great Google Maps API in R (OPTIONAL)

install.packages('ggmap')
library(ggmap)
geocode('sydney opera house')

from which you should get

lon lat
1 -95.69871 29.99138

How to get your credentials to access the Twitter API

Create a new Twitter account (even if you have a Twitter account it is better you create a new one dedicated to access the Twitter APIs);
Login to the Twitter Developer site using your new Twitter account (dev.twitter.com/apps) here;
Create a new app;
Fill in the form;

Create your Twitter application;
In the page of the new application click on the tab `Keys and Access tokens`;

Then on `Create my access token`;

Done! Now you have all the four credentials you need: `Consumer Key (API Key)`, `Consumer Secret (API Secret)`, ` Access Token`, `Access Token Secret`.

What's a for-loop

A for-loop is such a key concept in programming that it even has its own Wikipedia article.

Simply, a for-loop repeats a task a number of times (possibly an infinite number of times). So let's say we have a task, which we can express by a list of elementary functions. Remember, a function is a predefined set of instructions that we execute by calling it with myFunction(); a function can optionally take inputs and result in an output. So we are sitting at our desk, next to a pile of papers, and our task is to mark all the papers. Let's think about papers as a vector containing a series of 100 objects (the students' papers). The functions we need to do the markings are basically two

readPaper()
markPaper()

and we need to repeat it 100 times.

A for-loop allows us to to exactly that. We take our papers (one by one) from the pile of papers and once marked we put them back in a different pile (let's called it marked_papers). This is how you would do your marking in R with a for-loop (remember that the symbol <- assigns the output of a function to an object):


  marked_papers <- character() 
 
  for (paper in papers) { 

   idea_of_paper <- readPaper(paper) 

   marked_paper <- markPaper(idea_of_paper) 

   marked_papers <- c(marked_papers, marked_paper) 

  }

Let's try to undestand this code. First, we already have the object papers but we need to create a new (empty) object to store our marked papers. You don't want to loose all your work! So, with the first line we create a new object character vector and we call it marked_papers.

The actual loop starts on the second line. We declare the start of the loop with for (...) { and we close it with }. Everything within { ... } will be repeated a number of times. Yes, but how many times? The instructions for our loop are (paper in papers), which reads: "for every single paper in papers run the following lines". In other words, the number of iterations of the for-loop will depend on the number of objects contained in papers.

There is something imporant to understand here. An object paper is created at the beginning of each iteration (containing everytime a different paper). That is, the existence of paper is limited to the iteration. If you don't save it somewhere it will disappear. Nevertheless we are not interested in saving a copy of each paper (which is already contained in our papers) but only a copy of it once it has been marked (marked_paper). But again, at every iteration the line marked_paper <- markPaper(idea_of_paper) will replace any previously existing marked_paper with a new one. At the end of the for-loop, after 100 iterations, there will be only one marked_paper in memory: the last one. To store all papers we have marked we need to combine (with the function c()) each one into our vector marked_papers, which will be of lenght 0 at the beginning of the first iteration and of lenght 100 at the end of the last.

Workshop Code

The code is particularly dense this time. Additionally to the for-loop (which is a controller or control-flow contruct), we use new packages, new functions and we introduce some logical operators. The code for the next workshop is here. Download the file and open it in RStudio. But I first suggest you to read this section, so to have an idea of the new packages and programming concepts introduced with the code.

Packages

twitterR This package allows to get data directly from Twitter (API credentials needed)
tm This package is very popular to conduct text analysis in R (see Introduction to the tm Package: Text Mining in R)
dplyr This package makes easy to manipulate data (see Introduction to dplyr)
ggplot2 This package is probably the most popular graphic package in R (see ggplot2 Elegant Graphics for Data Analysis [free access form opac.library.usyd.edu.au] and Introduction to R Graphics with ggplot2)

Operators

The two most common classes of operators are relational operators and logical operators.

Relational operators are

x < y, which tests if x is less than y
x > y, which tests if x is more than y
x <= y, which tests if x is less than or equal to y
x >= y, which tests if x is more than or equal to y
x == y, which tests if x is equal to y (different from =, which doesn't test anything but assign the value on the right-side of the sign to the variable on the left-side)
x != y, which tests if x is different from y

The most common logical operators are

!, which indicates logical negation
&, which inidates logical AND
|, which indicate logical NOT

Let's start now by assigning to variable the value 5

variable = 5

Then let's run some tests using the relational operators


variable > 10 

# [1] FALSE 

variable == 5 

# [1] TRUE 

variable >= 5 

# [1] TRUE 

variable != 5 

# [1] FALSE

Finally let's combine them with the logical operators (remember, the value of variable is 5)


!(variable == 5) 

# [1] FALSE 

(variable == 5) & (variable+1 < 6)

[1] FALSE

but


(variable == 5) | (variable+1 < 6) 

# [1] TRUE

Controllers

Controllers (more precisely control-flow constructs) are fundamental in every programming language because allow the programmer to add conditions to the flow of the program. We already described the use of one controller (for). The other popular controller is if (which we can combine with else).


  if (variable == 2) { 

   doSomething() 

  } else { 

   doSomethingElse() 

  }