Wednesday, December 10, 2014

Sentiment Analysis on Twitter with R


 In the previoust post we explain how to install R in Ubuntu. R offers a wide variety of options to do lots of interesting and fun things. And this post shows you precisely how to do them.

1. How to get data from twitter?

So the first thing to do is get some data from twitter.

There are two primary ways to obtain data. In order of complexity these are:

a) Using the R package "twitteR"
b) Using the R package "XML"


2. Using the R package "twitterR"

You don´t have to download it from a website, you can do it directly from within R.

You can to it with:


> install.packages('twitteR', dependencies=T)

You then have to select a CRAN mirror, from where you want to download it and click ok.
R will now download the package and install it. If you see some errors maybe this article can help you.


Then we have to activate it for our current session with:

> library(twitteR)
Loading required package: ROAuth
Loading required package: RCurl
Loading required package: bitops
Loading required package: rjson

> library(plyr)
Error in library(plyr) : there is no package called ‘plyr’

Try setting your repo to a different mirror like this:

> options(repos="http://streaming.stat.iastate.edu/CRAN")

or use any other mirror of your choice.

Then try loading plyr:

> install.packages("plyr")
> library("plyr") 


> options(repos="http://cran.rstudio.com/bin/linux/ubuntu precise/")


3. Twitter authentication

First we need to create an app at Twitter.


Go to https://apps.twitter.com and log in with your Twitter Account.

Once you have created your application...

Continue to R and type in the following lines:

> reqURL <- "https://api.twitter.com/oauth/request_token"
> accessURL <- "https://api.twitter.com/oauth/access_token"
> authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "yourconsumerkey"
consumerSecret <- "yourconsumersecret" 

You have to replace yourconsumerkey and yourconsumersecret with the data provided on your app page on Twitter, still opened in your webbrowser.

twitCred <-   OAuthFactory$new(consumerKey=consumerKey,consumerSecret=consumerSecret,requestURL=reqURL,accessURL=accessURL,authURL=authURL)
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
twitCred$handshake(cainfo="cacert.pem")

You should see something like that:

To enable the connection, please direct your web browser to:
https://api.twitter.com/oauth/authorize?oauth_token=xxxxxxxxxxxxxxxxxxxxxx
When complete, record the PIN given to you and provide it here:

registerTwitterOAuth(twitCred)

4. Processing tweets data via twitteR

Let's collect some tweets containing the term "C.I.A torture"

# collect tweets in english containing 'C.I.A torture'
> tweets = searchTwitter("C.I.A torture", n=200, cainfo="cacert.pem")

To be able to analyze our tweets, we have to extract their text and save it into the variable tweets_content by typing:

> tweets_content = laply(tweets,function(t)t$getText())

What we also need are our lists with the positive and the negative words. We can find them here

After download the words, now we have to load the words in variables to use them by typing:

> neg= scan('/path/negative-words.txt', what='character', comment.char=';')
> pos= scan('/path/positive-words.txt', what='character', comment.char=';')

> install.packages("stringr")

Now we have to insert a small algorhytm written by Jeffrey Breen analyzing our words.

Just copy-paste the following lines and hit enter:

#function to calculate number of words in each category within a sentence
> score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
    require(plyr)
    require(stringr)
     
    # we got a vector of sentences. plyr will handle a list
    # or a vector as an "l" for us
    # we want a simple array ("a") of scores back, so we use 
    # "l" + "a" + "ply" = "laply":
    scores = laply(sentences, function(sentence, pos.words, neg.words) {
         
        # clean up sentences with R's regex-driven global substitute, gsub():
        sentence = gsub('[[:punct:]]', '', sentence)
        sentence = gsub('[[:cntrl:]]', '', sentence)
        sentence = gsub('\\d+', '', sentence)
        # and convert to lower case:
        sentence = tolower(sentence)

        # split into words. str_split is in the stringr package
        word.list = str_split(sentence, '\\s+')
        # sometimes a list() is one level of hierarchy too much
        words = unlist(word.list)

        # compare our words to the dictionaries of positive & negative terms
        pos.matches = match(words, pos.words)
        neg.matches = match(words, neg.words)
     
        # match() returns the position of the matched term or NA
        # we just want a TRUE/FALSE:
        pos.matches = !is.na(pos.matches)
        neg.matches = !is.na(neg.matches)

        # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
        score = sum(pos.matches) - sum(neg.matches)

        return(score)
    }, pos.words, neg.words, .progress=.progress )

    scores.df = data.frame(score=scores, text=sentences)
    return(scores.df)
}


> analysis = score.sentiment(tweets_content , pos, neg)

Very Negative (rating -5 or -4)
Negative (rating -3, -2, or -1)
Positive (rating 1, 2, or 3)
Very Positive (rating 4 or 5)


You can get a table by typing:

>  table(analysis$score)

Or the mean by typing:

>  mean(analysis$score)

Or get a histogram with:

>  hist(analysis$score)




No comments:

Post a Comment