Overview¶

With the amount of shit people feel the need to share with the internet, I am wondering if I can use Twitter to survey cases of the flu. To do this, I can use a Twitter API, tweepy, to scrape tweets and their locations based on key words. However, I will need a way to determine if the tweet is referencing a case of the flu, or is using the word in some other context. That’s where the machine learining comes in. In this notebook, I use a multinomial naive bayes classifier to pinpoint cases of the flu self-reported over Twitter.

network-3154913_1920

Training Data¶

Here, I’m going stream all tweets with the word “flu” in it, cut out some of the baggage attached to those tweets, then append the tweets to a CSV that I’ll later go through and classify.
Once the tweets start rolling in, you can get an idea for how you want to do your classifications.

I ended up having three categories: “Accept”, “Reject”, and “Other”. The “Accept” category is for the tweets I want, the tweets that admit to having the flu. I made the “Other” category because I noticed that swine and bird flu came up very frequently. I thought that the high frequency of unaccapted tweets having bird or swine flu in it would result in the acceptance of basically any tweet that did not have bird or swine in them.
The below script shows how I scraped/parsed the data.

In [1]:

from tweepy import Stream
from tweepy import OAuthHandler
from tweepy.streaming import StreamListener
import json
import re
import time
import os
import sys

module_path = os.path.abspath(os.path.join('/Users/macuser/PycharmProjects/td2'))

if module_path not in sys.path:
    sys.path.append(module_path)
    
from myAccess import consumerKey, consumerSecret, accessToken, accessSecret

#AccessCodes
consumerKey = consumerKey
consumerSecret = consumerSecret
accessToken = accessToken
accessSecret = accessSecret

#Scraping/Parsing Tweets for Training
class Listener(StreamListener):  
    def on_data(self, raw_data):
        try:
            jsonData = json.loads(raw_data) #Convert Tweet data to json object
            text = jsonData['text'] #Parse out the tweet from the json object
            if not jsonData['retweeted'] and 'RT @' not in text and jsonData['lang'] == 'en': #Excludes Retweets
                text = re.sub(r"(http\S+|#\b)", "", text) #Gets rid of links and the # in front of words
                text = " ".join(filter(lambda x:x[0]!='@', text.split())) #Gets rid of the @ in front of mentions
                text = text.encode('unicode-escape').decode(encoding='utf-8').replace('\\', ' ') #Converts Emojis to a String
                text = text.replace('u2026','') #Gets rid of ellipses
                text = text.replace('"',"'") #Replaces " with ' so that it doesn't break the field when read from CSV
                print(text)
                with open('tdTraining.csv', 'a') as file: #Write to CSV
                     file.write('\n'+'"'+text+'"'+',') #Adds quotes around tweet to set the field
        except BaseException as e:
            print("Failed ondata,", str(e))
            time.sleep(0.1)
    def on_error(self, status_code):
        print(status_code)

#Access
auth = OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessSecret)

#Initiate Streaming
twitterStream = Stream(auth, Listener())
twitterStream.filter(track = ["flu"])

/Users/macuser
High fever and flu is not good at 15 xb0C, i have deceased
I have got hay fever + I've got a cold + I've got the flu
RT APHealthScience: Flu vaccine was about 42 percent effective last winter, but it did a poor job protecting older 
Could flu during pregnancy raise risk for autism? - Researchers found no evidence that laboratory-diagnosis alo...
About damn time. Now you don't have open the box and put your bird flu cover hands im the carton 
AP: RT APHealthScience: Flu vaccine was about 42 percent effective last winter, but it did a poor job protecting o 
80+ but do it in flu season and wait til he gets the flu where he kills himself after like 3 turns
Could flu during pregnancy raise risk for autism?
AP: RT APHealthScience: Flu vaccine was about 42 percent effective last winter, but it did a poor job protecting o

Training the Classifier¶

I ended up classifying about 3,000 tweets over the course of about a week. Generally, your corpus should contain around 50,000 documents. Proabably more for tweets since they are so small. Nonetheless, I got sick of classifying and decided to move on.

First thing to do is load our tweets and their classifications. I’ll store them in a pandas dataframe.

In [2]:

import pandas as pd
trainingData = pd.read_csv('/Users/macuser/PycharmProjects/td2/tdTraining.csv', quotechar='"')
print(trainingData.head(10))

                                                text     cat
0  Last night's thread on cat flu. For all catsof...   Other
1      Fucking flu!!!  U0001f620 U0001f612 U0001f637  Accept
2                                       this flu gtg  Accept
3  Ohhhh...its hapend to me also...flu really dis...  Accept
4  Diabetes Treatments - Top 7 Diabetes Friendly ...  Reject
5  Liked on YouTube: Number 1 Natural Remedy for ...  Reject
6  My nose is red. Flu, go away please. I need to...  Accept
7  I wish I can sleep, but this stomach flu ain't...  Accept
8  Late summer and early fall before Chef was mur...  Reject
9  No problem ever went away because it was ignor...  Reject

Now we need to build a pipeline. First, the text needs to be tokenized, which we can do with scikit’s countvectorizer. The gram size is the number of words that get chunked together e.g.

“The sky is blue”

unigrams: “The”, “sky”, “is”, “blue”
bigrams: “The sky”, “is blue”, “sky is”
trigrams: “The sky is”, “sky is blu”

Generally, a gram size of two is optimal in terms of improving accuracy and cpu exertion, but play around with it nonetheless.
Second, convert word counts to frequency with tfidftransformer. This accounts for the over-appearance of words in large bodies of text. This is not a huge deal for us since tweets are limited to 144 characters, but it is almost always done in building text classifiers and it still improves accuracy by a few percentage points.

Next is choosing the algorithm and, subsequently, its parameters. Most text classifiers use either the multinomial naive bayes algorithm or a support vector classifier, others to experiment with may include: random forest, bernoulli naive bayes, or stochastic gradient descent.

Since I’m using the mNB classifier, I’ll talk about it some. The mNBc looks at each token independently and associates with a class based on the frequency it appears in said class. This factors in to the conditional probability- the liklihood that a text belongs to a class given its the overall association its token’s have with that class. The other thing the mNBc takes into account is the prior distribution of classes. This simply means that the classifier will be biased to assign texts to classes that appeared more often than others in the training data.

Scikit’s mNB will automatically calculate priors, but I found that my custom priors led to 2-3% higher accuracy. The other paramater, alpha, is a smoothing parameter that adds an equal and small probability for never-before-seen words to associate with all classes.

Here’s the full pipeline for the classifier:

In [3]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,2))),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB(alpha=1,class_prior=[.27,.45,.28]))])

Now split the data into training and testing sets, then train/test the classifier.

In [4]:

from sklearn import cross_validation
featureTrain, featureTest, labelTrain, labelTest = cross_validation.train_test_split(
    trainingData['text'], trainingData['cat'],test_size=0.20)
fit = text_clf.fit(featureTrain, labelTrain)

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Getting the results.

In [5]:

accuracy = text_clf.score(featureTest, labelTest)
predictions = text_clf.predict(featureTest)
crossTable = pd.crosstab(labelTest, predictions, rownames=['Actual:'], colnames=['Predicted:'], margins=True)
falsePos = sum(crossTable['Accept'].drop('All').drop('Accept')) / crossTable['Accept']['All']
falseNeg = (crossTable['Other']['Accept'] + crossTable['Reject']['Accept'])/\
           (crossTable['Other']['All'] + crossTable['Reject']['All'])
print("Accuracy:",accuracy)
print(crossTable)
print("False Positives:",falsePos,'\n',"False Negatives:",falseNeg)

Accuracy: 0.794019933555
Predicted:  Accept  Other  Reject  All
Actual:                               
Accept         146     10      22  178
Other           13    161      30  204
Reject          27     22     171  220
All            186    193     223  602
False Positives: 0.215053763441 
 False Negatives: 0.0769230769231

I like to throw all this in a for loop to get an average accuracy over n runs and add a confustion matrix.

In [6]:

accuracy = []
fP = []
fN = []
for i in range(100):
    featureTrain, featureTest, labelTrain, labelTest = cross_validation.train_test_split(
        trainingData['text'], trainingData['cat'], test_size=0.20)
    fit = text_clf.fit(featureTrain, labelTrain)
    test = text_clf.score(featureTest, labelTest)
    predictions = text_clf.predict(featureTest)
    crossTable = pd.crosstab(labelTest, predictions, rownames=['Actual:'], colnames=['Predicted:'], margins=True)
    falsePos = sum(crossTable['Accept'].drop('All').drop('Accept')) / crossTable['Accept']['All']
    falseNeg = (crossTable['Other']['Accept'] + crossTable['Reject']['Accept'])/\
           (crossTable['Other']['All'] + crossTable['Reject']['All'])
    accuracy.append(test)
    fP.append(falsePos)
    fN.append(falseNeg)

print("Accuracy:",sum(accuracy)/float(len(accuracy)))
print("False Positives:",sum(fP)/float(len(fP)))
print("False Negatives:",sum(fN)/float(len(fN)))

Accuracy: 0.799368770764
False Positives: 0.218822638591
False Negatives: 0.0650916583851

Roughly 80%, that’s OK considering our small sample size (one of the benefits of the mNBc is that it works well on small sample sizes). You can use the false positves/negatives to tweak your priors. If you think there’s too many false positives for a particular class, lower the prior distribution for that class.

Let’s test out the classifier.

In [7]:

text = "I have the flu" #Should be.......Accept
text2 = "My sister has the flu"#.........Reject
text3 = "I had the flu last month."#.....Reject
text4 = "Top 10 ways to treat the flu."#.Reject
text5 = "Dog flu outbreak."#.............Other
print(text_clf.predict([text,text2,text3,text4,text5]))

['Accept' 'Reject' 'Reject' 'Other' 'Other']

Looks like the classifier is working pretty good, but I did throw it some softball questions.
The next step is to save the classifier to a pickle object so i don’t have to retrain it.

In [8]:

import pickle
saveClassifier = open("tdMNB.pickle","wb")
pickle.dump(fit, saveClassifier)
saveClassifier.close()

Now that we have a trained classifier, we can start classifying some tweets.

Streaming Tweets¶

Now we are at the point where we can start filling out our databse. The following script is similar to the “Training Data” script only now we need to get location information, classify the tweet, and save to a database instead of a CSV. I’m using a module called geopy to to find latitude and longitude coordinates from the location information provided with the tweet. If the location is valid and in the US, we then classify the tweet. If it returns an “Accept” classification, it is saved to a SQLite database with its coordinates, the datetime of the tweet, and the tweet ID (for cross-referencing and primary key).

Edit: I did this about 8 months ago and have no idea why I used a sql db over a csv, lol.

In [9]:

from tweepy import Stream, OAuthHandler
from tweepy.streaming import StreamListener
import os
import sys

module_path = os.path.abspath(os.path.join('/Users/macuser/PycharmProjects/td2'))

if module_path not in sys.path:
    sys.path.append(module_path)
    
import time, json, sqlite3, re, myAccess
from geopy.geocoders import Nominatim
from time import mktime
from datetime import datetime
import pickle

#ConnectingDB
conn = sqlite3.connect('twitterdemiologyDB.db')
c = conn.cursor()

#CreatingTable
c.execute('CREATE TABLE IF NOT EXISTS Main(id TEXT, date DATETIME, lat TEXT, lon TEXT)')

#AccessCodes
consumerKey = myAccess.consumerKey
consumerSecret = myAccess.consumerSecret
accessToken = myAccess.accessToken
accessSecret = myAccess.accessSecret

#Loading Classifier
classifierF = open("tdMNB.pickle","rb")
classifier = pickle.load(classifierF)
classifierF.close()

#Scraping/ParsingTweets
class Listener(StreamListener):

    def on_data(self, raw_data):
        try:
            jsonData = json.loads(raw_data)
            #Converting date to datetime format:
            date = jsonData['created_at']
            date2 = str(date).split(' ')
            date3 = date2[1]+' '+date2[2]+' '+date2[3]+' '+date2[5]
            datetime_object = time.strptime(date3, '%b %d %H:%M:%S %Y')
            dt = datetime.fromtimestamp(mktime(datetime_object))
            #Parsing out ID, the tweet itself, and location:
            tweetID = jsonData['id_str']
            pretweet = jsonData['text']
            userInfo = jsonData['user']
            location = userInfo['location']
            if jsonData['lang'] == 'en' and location != 'Midwest' and location!= 'Whole World' and location != 'Earth':
                #print(dt, pretweet, location)
                geolocator = Nominatim()
                geolocation = geolocator.geocode(location)
                try:
                    #The 2-5 len range helps to remove inaccurate/unspecific locations
                    if "United States of America" in geolocation.address[::] and 5>= len(geolocation.address.split(",")) > 2 :
                        lat = geolocation.latitude
                        lon = geolocation.longitude
                        print(geolocation.address, '\n', lat, lon)
                        if not jsonData['retweeted'] and 'RT @' not in pretweet:
                            tweet = re.sub(r"(http\S+|#\b)", "", pretweet)
                            tweet = " ".join(filter(lambda x: x[0] != '@', tweet.split()))
                            tweet = str(tweet.encode('unicode-escape')).replace('\\', ' ')
                            print(tweet)
                            classification = classifier.predict([tweet])[0]
                            if classification == 'Accept':
                                print("Tweet Accepted")
                                c.execute('INSERT INTO Main(id, date, lat, lon) VALUES(?,?,?,?)',
                                          (tweetID, dt, lat, lon))
                                conn.commit()
                            else:
                                print("Tweet Rejected")
                        else:
                            print("Is a Retweet")
                    else:
                        print("Location not in USA")
                except Exception as e:
                    print("Invalid locaion:", str(e))
        except BaseException as e:
            print("Failed ondata,",str(e))
            time.sleep(0.1)
    def on_error(self, status_code):
        print(status_code)

#Access
auth = OAuthHandler(consumerKey, consumerSecret)
auth.set_access_token(accessToken, accessSecret)

#InitiateStreaming
twitterStream = Stream(auth, Listener())
twitterStream.filter(track=['flu'])

/Users/macuser
Location not in USA
Location not in USA
Location not in USA
Location not in USA
Invalid locaion: 'NoneType' object has no attribute 'address'

Most of the tweets that come in are rejected do to invalid locations. Of the valid locations, about %30 are classified as “Accept” and enter the database. I get about 100 useable tweets/day (it’s currently Summer, will probably get a lot more come Winter.

Mapping the Data¶

What I’m going to do here is plot the data onto a heatmap. This loop will create heatmaps 1 day at a time containing the past 7 days of data. I used the module gmap to create the heatmaps.

In [10]:

import sqlite3
import gmplot
from datetime import timedelta
from datetime import datetime



#ConnectingToDB
conn = sqlite3.connect('twitterdemiologyDB.db')
c = conn.cursor()

#QueryingData
scanner = str('2017-06-22 10:59:25')
for i in range(10): #Number of days that the data spans I just put 10 becuase I don't want a
                    #bunch of html files in my dir (I already made them)
    tail = str(datetime.strptime(scanner, '%Y-%m-%d %H:%M:%S') - timedelta(hours=168))
    c.execute('SELECT lat, lon FROM Main WHERE date > "{x}" AND date < "{y}"'.format(x=tail,y = scanner))
    latArray = []
    lonArray = []
    for row in c.fetchall():
        if row[0] and row[1]:
            latArray.append(float(row[0]))
            lonArray.append(float(row[1]))
    #MappingData
    file = 'flu' + str(scanner) + '.html'
    gmap = gmplot.GoogleMapPlotter.from_geocode("United States")
    gmap.heatmap(latArray, lonArray)
    gmap.draw(file)
    scanner = str(datetime.strptime(scanner, '%Y-%m-%d %H:%M:%S') + timedelta(hours=24))

Now I want to take these html files and make a gif out of them to show
a timelapse of flu cases over time.
There are packages such as imgkit that will take a screenshot of html files and
save them as png’s, but I could not get them to work with gmaps (they worked for other html files, though)
so I just manually screenshotted them with chrome addon.

Here is the final result:

tdCrop

Tracking the Flu with Twitter & Machine Learning

Overview¶

Training Data¶

Training the Classifier¶

Streaming Tweets¶

Mapping the Data¶

Share this:

Comments

Leave a comment Cancel reply