The tutorial will work through practical examples showing you how to extract and visualize Twitter data using R.
In order to extract tweets, you will need a Twitter application and hence a Twitter account. If you don’t have a Twitter account, please sign up.
Use your Twitter login ID and password to sign in at Twitter Developers.
Navigate to Twitter Developers. Click the button Create New App in the upper right corner.
Fill in the required application details, including Name, Description, and Website. Note that the Name must be unique. If your chosen name has been taken, try a new one. Click the button Create your Twitter application in the lower-left corner.
Click the Keys and Access Tokens tab under your created Twitter Application. Then click the button Create my access token in the lower-left corner
Note the values of Consumer Key (API Key), Consumer Secret (API Secret), Access Token, and Access Token Secret handy for future use. You should keep these secret. If anyone was to get these keys, they could effectively access your Twitter account.
For the purpose of this tutorial, we will need the following packages:
ROAuth: Provides an interface to the OAuth 1.0 specification, allowing users to authenticate via OAuth to the server of their choice.
TwitteR: Provides an interface to the Twitter web API.
if (!require(twitteR)) {install.packages("twitteR")}
if (!require(ROAuth)) {install.packages("ROAuth")}
library(twitteR)
library(ROAuth)
Authorize App to use your account, i.e., established handshake between Twitter and R.
# you must get the following information from the Twitter App you just created
my.consumer.key = "fH4IijcQUrwxEQ3mmb6G2gzUc"
my.consumer.secret = "FxkuV6ePyFaia2LmyxetoH50IxGrQcEYbwnLe3EjVDWsCdPrhJ"
my.access.token = "99989439-wG82y2hMmAmlJ1iIlQgNu0l65ZOKVuVscj4Idm9Xu"
my.access.token.secret = "QNehvUZOGNNZyqFYDCAP6tlWEHWIBbKIhiqPEKAM2SOoT"
my_oauth <- setup_twitter_oauth(consumer_key = my.consumer.key, consumer_secret = my.consumer.secret, access_token = my.access.token, access_secret = my.access.token.secret)
[1] "Using direct authentication"
save(my_oauth, file = "my_oauth.Rdata")
search.string <- "#HurricaneNate"
result.term <- searchTwitter(search.string, n = 100)
head(result.term)
[[1]]
[1] "mdorsey92: RT @jacobdickeywx: The eyewall approaches #HURRICANENATE #NATE #GULFPORT https://t.co/uFaobimvjY"
[[2]]
[1] "Mi_Puerto_Rico: RT @Readygov: Before heading to bed make sure you have different ways to receive emergency alerts throughout the night. #HurricaneNate http…"
[[3]]
[1] "SVH2: RT @FoxNews: .@AdamKlotzFNC on #HurricaneNate: \"Currently you're looking at 90 mph winds, that's a Category 1 storm.\" https://t.co/RnTJDFp7…"
[[4]]
[1] "YusefforPeace: RT @MeritLaw: Our volunteers at #AmericanBlackCross are preparing for #HurricaneNate. Visit https://t.co/jCrBVhYPN4 to pitch in! #ItsOnUs #…"
[[5]]
[1] "Torchbug: RT @realDonaldTrump: Our great team at @FEMA is prepared for #HurricaneNate. Everyone in LA, MS, AL, and FL please listen to your local aut…"
[[6]]
[1] "Butch2763: RT @realDonaldTrump: Our great team at @FEMA is prepared for #HurricaneNate. Everyone in LA, MS, AL, and FL please listen to your local aut…"
df.term <- twListToDF(result.term)
write.csv(df.term, "HurricaneNate.csv")
result.latlon <- searchTwitter('nba', geocode='29.8174,-95.6814,20mi', n = 100)
head(result.latlon)
[[1]]
[1] "ELOSSports: Maurice Evans @1MoEvans shares his personal story, #ELOSSports, and #NBA. #Entrepreneurs #Athletes #sportstech… https://t.co/TbLAUKylGk"
[[2]]
[1] "Jabari316: I added a video to a @YouTube playlist https://t.co/m07ktuDRxH NBA 2K18 iOS My Career - WE LIT - BACK WITH THE PLAYMAKER"
[[3]]
[1] "jessielantz: RT @HoustonRockets: Throughout October and November, we will honor the Heroes of Harvey that YOU nominate & recognize them on-court: https:…"
[[4]]
[1] "SWHTown30: You called this man Gary Harris one of the worse defending guards in the NBA https://t.co/bwceHj9HXJ"
[[5]]
[1] "SWHTown30: Tatum is not ready for the NBA and will be hard to notice the talent right now https://t.co/PCjGjhDwje"
[[6]]
[1] "kerstindonota_9: RT @HoustonRockets: Squad is up 69-37 with 3:41 left in the first-half. \n\n\xed愼㸰戼㹤\xed戼㸳\u008a Gordon 14pts\n\xed愼㸰戼㹤\xed戼㸳\u008a Capela 12pts 9reb \n\xed愼㸰戼㹤\xed戼㸳\u008a CP3 11pts/7ast \n\nLive at ht…"
df.latlon <- twListToDF(result.latlon)
write.csv(df.latlon, "NBA.csv")
You can use TwitteR to identify what is currently “trending” on Twitter in a specific location by using Yahoo’s Where On Earth ID, or woeid. You can look at all places around the world that have a woeid by entering the following R script:
availableTrendLocations()
You can also find the woeid for any places near a particular latitude-longitude coordinate pair. To find the woeid for New York City, you can enter the following R script:
closestTrendLocations(40.736881,-73.98887)
Let’s use the woeid for New York to collect data on what is trending in New York.
ny <- getTrends(2459115)
head(ny,n = 10)
write.csv(ny, "NYtrends.csv")
To take a closer look at a Twitter user (including yourself!), run the command getUser. This will only work correctly with users who have their profiles public, or if you’re authenticated and granted access. You can also see things such as a user’s followers, who they follow, retweets, and more. The getUser function returns a user object, which can then be polled for further information.
test_user <- getUser("binghamtonu")
test_user$id
[1] "23790666"
test_user$getDescription()
[1] "Official #BinghamtonU Twitter! Founded in 1946, Binghamton University is the premier public university in the Northeast."
test_user$getFollowersCount()
[1] 29537
test_user$getFriends(n=5)
$`3420478114`
[1] "BingPharmacy"
$`755044097874882560`
[1] "DrACR24"
$`2720523253`
[1] "allisonkhin"
$`3282859598`
[1] "TwitterVideo"
$`3376435420`
[1] "AtlantaBlaze"
The userTimeline function will allow you to retrieve various timelines within the Twitter universe.
userTimeline(user = "realDonaldTrump", n = 5)
[[1]]
[1] "realDonaldTrump: Leaving the White House for the Great State of North Carolina. Big progress being made on many fronts!"
[[2]]
[1] "realDonaldTrump: Thanks for your support! https://t.co/iqUM1RfQso"
[[3]]
[1] "realDonaldTrump: ...hasn't worked, agreements violated before the ink was dry, makings fools of U.S. negotiators. Sorry, but only one thing will work!"
[[4]]
[1] "realDonaldTrump: Presidents and their administrations have been talking to North Korea for 25 years, agreements made and massive amounts of money paid......"
[[5]]
[1] "realDonaldTrump: Will be joining @GovMikeHuckabee tonight at 8pmE on @TBN. Enjoy! https://t.co/Y5hGPpYZfl"
In this part we will use R to visualize tweets as a word cloud to find out what people are tweeting about the NBA (#nba). A word cloud is a visual representation showing the most relevant words (i.e., the more times a word appears in our tweet sampling the bigger the word).
if (!require(twitteR)) {install.packages("twitteR")}
if (!require(ROAuth)) {install.packages("ROAuth")}
library(twitteR)
library(ROAuth)
# you must get the following information from the Twitter App you just created
my.consumer.key = "fH4IijcQUrwxEQ3mmb6G2gzUc"
my.consumer.secret = "FxkuV6ePyFaia2LmyxetoH50IxGrQcEYbwnLe3EjVDWsCdPrhJ"
my.access.token = "99989439-wG82y2hMmAmlJ1iIlQgNu0l65ZOKVuVscj4Idm9Xu"
my.access.token.secret = "QNehvUZOGNNZyqFYDCAP6tlWEHWIBbKIhiqPEKAM2SOoT"
my_oauth <- setup_twitter_oauth(consumer_key = my.consumer.key, consumer_secret = my.consumer.secret, access_token = my.access.token, access_secret = my.access.token.secret)
[1] "Using direct authentication"
tweets <- searchTwitter("#nba", n=1000, lang="en")
tweets.text <- sapply(tweets, function(x) x$getText())
We have already been authenticated and successfully retrieved the text from the tweets using #nba. The first step in creating a word cloud is to clean up the text by using lowercase and removing punctuation, usernames, links, etc. We are using the function gsub to replace unwanted text. Gsub will replace all occurrences of any given pattern. Although there are alternative packages that can perform this operation, we have chosen gsub because of its simplicity and readability.
# Replace blank space (“rt”)
tweets.text <- gsub("rt", "", tweets.text)
# Replace @UserName
tweets.text <- gsub("@\\w+", "", tweets.text)
# Remove punctuation
tweets.text <- gsub("[[:punct:]]", "", tweets.text)
# Remove links
tweets.text <- gsub("http\\w+", "", tweets.text)
# Remove tabs
tweets.text <- gsub("[ |\t]{2,}", "", tweets.text)
# Remove blank spaces at the beginning
tweets.text <- gsub("^ ", "", tweets.text)
# Remove blank spaces at the end
tweets.text <- gsub(" $", "", tweets.text)
# #convert all text to lower case
tweets.text <- tolower(tweets.text)
In the next step we will use the text mining package tm to remove stop words. A stop word is a commonly used word such as “the”. Stop words should not be included in the analysis.
if(!require(tm)) {install.packages("tm")}
library(tm)
#create corpus
#clean up by removing stop words
tweets.text.corpus <- tm_map(tweets.text.corpus, function(x) removeWords(x,stopwords()))
Now we’ll generate the word cloud using the wordcloud package. For this example we are concerned with plotting no more than 150 words that occur more than once with random color, order, and position.
if(!require(wordcloud)) {install.packages("wordcloud")}
library(wordcloud)
#generate wordcloud
wordcloud(tweets.text.corpus,min.freq = 2, scale=c(7,0.5),colors=brewer.pal(8, "Dark2"), random.color= TRUE, random.order = FALSE, max.words = 150)
Sentiment analyses classify communications as positive, negative, or neutral. Determining sentiment ranges from very simple classification methods to very complex algorithms. For ease and transparency in this example, we will classify the sentiment of a tweet based on the polarity of the individual words. Each word will be given a score of +1 if classified as positive, -1 if negative, and 0 if classified as neutral. This will be determined using positive and negative lexicon lists compiled by Minqing Hu and Bing Liu for their work “Mining and Summarizing Customer Reviews”. The total polarity score of a given tweet will result in adding together the scores of all the individual words. Once you go to the page, click on Opinion Lexicon and then download the rar file.
# Install packages for sentiment analysis
if (!require(twitteR)) {install.packages("twitteR")}
if (!require(ROAuth)) {install.packages("ROAuth")}
if (!require(plyr)) {install.packages("plyr")}
if (!require(stringr)) {install.packages("stringr")}
if (!require(ggplot2)) {install.packages("ggplot2")}
library(twitteR)
library(ROAuth)
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)
# you must get the following information from the Twitter App you just created
my.consumer.key = "fH4IijcQUrwxEQ3mmb6G2gzUc"
my.consumer.secret = "FxkuV6ePyFaia2LmyxetoH50IxGrQcEYbwnLe3EjVDWsCdPrhJ"
my.access.token = "99989439-wG82y2hMmAmlJ1iIlQgNu0l65ZOKVuVscj4Idm9Xu"
my.access.token.secret = "QNehvUZOGNNZyqFYDCAP6tlWEHWIBbKIhiqPEKAM2SOoT"
my_oauth <- setup_twitter_oauth(consumer_key = my.consumer.key, consumer_secret = my.consumer.secret, access_token = my.access.token, access_secret = my.access.token.secret)
[1] "Using direct authentication"
save(my_oauth, file = "my_oauth.Rdata")
neg = scan("negative-words.txt", what="character", comment.char=";")
Read 4783 items
pos = scan("positive-words.txt", what="character", comment.char=";")
Read 2006 items
score.sentiment = function(tweets, pos.words, neg.words)
{
scores = laply(tweets, function(tweet, pos.words, neg.words) {
tweet = gsub('https://','',tweet) # removes https://
tweet = gsub('http://','',tweet) # removes http://
tweet=gsub('[^[:graph:]]', ' ',tweet) ## removes graphic characters #like emoticons
tweet = gsub('[[:punct:]]', '', tweet) # removes punctuation
tweet = gsub('[[:cntrl:]]', '', tweet) # removes control characters
tweet = gsub('\\d+', '', tweet) # removes numbers
tweet=str_replace_all(tweet,"[^[:graph:]]", " ")
tweet = tolower(tweet) # makes all letters lowercase
word.list = str_split(tweet, '\\s+') # splits the tweets by word in a list
words = unlist(word.list) # turns the list into vector
pos.matches = match(words, pos.words) ## returns matching
#values for words from list
neg.matches = match(words, neg.words)
pos.matches = !is.na(pos.matches) ## converts matching values to true of false
neg.matches = !is.na(neg.matches)
score = sum(pos.matches) - sum(neg.matches) # true and false are
#treated as 1 and 0 so they can be added
return(score)
}, pos.words, neg.words )
scores.df = data.frame(score=scores, text=tweets)
return(scores.df)
}
tweets = searchTwitter('Trump',n=2500)
Tweets.text = laply(tweets,function(t)t$getText()) # gets text from Tweets
analysis = score.sentiment(Tweets.text, pos, neg) # calls sentiment function
table(analysis$score)
-5 -4 -3 -2 -1 0 1 2 3 4 5
1 2 55 121 406 597 826 295 181 14 2
hist(analysis$score)