If you actively use twitter, you probably know that it’s common for tweets to have several hashtags. If you search for tweets with a particular hashtag, you often see other related hashtags repeating over and over again as certain topics trend together.
If that information was distilled down to a single graphic that tells you what hashtags are frequently occurring together, you could easily get some quick insight about current related news or cultural items. The topic of today’s post will be making such a graphic.
But first, we will need to build a little machinery to get the job done. I’ll be using the TwitterSearch API for python to get the data, and matplotlib/pyplot for displaying results. This is the full set of imports:
from TwitterSearch import * import matplotlib.pyplot as plt import collections,sys,math
Step 1: Parsing the tweet.
Tweets can be ugly, and I don’t mean in their verbal content. In an ideal world, we would be streaming in tweets that look like:
New #cosmos series is great public outreach for #science and #physics
But often what we see are things like:
WOAH just saw #COSMOs #Science!!#physics.
So, we have some cleaning to do. What we need is a function that separates hashtags from other words, puts everything in the same case, cleans punctuation, removes possessives (like “#obama’s statement”), etc. This is my solution:
def FindHashHags(tweet): """ This function takes the twittersearch output tweet, cleans up the text and the format, and returns the set of all hashtags in the tweet """ # First get the tweet text tweettxt = tweet['text'].encode('ascii','ignore') # People sometimes stack hashtags with no spacing # Add spacing before the hashtag symbol tweettxt = tweettxt.replace('#',' #') # Clean all punctuation which sometimes # gets cluttered in with the tag for punct in '.!",;:%<>/~`()[]{}?': tweettxt = tweettxt.replace(punct,'') # Split the tweet string into a list of words, # some of which will be hashtagged # print tweettxt tweettxt = tweettxt.split() # Initiatie list of hashtags hashtags = [] # Loop over the words in the tweet for word in tweettxt: # Find words beginning with hashtag if word[0]=='#': # Lower-case the word hashtag = word.lower() # Correct for possisives hashtag= hashtag.split('\'')[0] # Get rid of the hashtag symbol hashtag = hashtag.replace('#','') # Make sure there is text left, append to list if len(hashtag)>0: hashtags.append(hashtag) # return clean list of hashtags return hashtags
To follow up on our first example, we can do:
MyTweet = {} MyTweet['text'] = 'WOAH just saw #COSMOs #Science!!#physics.' print FindHashHags(MyTweet)
and we will just see: [‘cosmos’, ‘science’, ‘physics’]
Step 2: Searching and collecting hashtags
TwitterSearch will allow us to search with a hashtag. We want input a phrase like “python” and get back the frequencies of the top related hashtags in the last ~1000 tweets. To to do so you need to be a twitter developer to run the search, and you will need your own consumer/access keys and secret. See: https://dev.twitter.com/docs/auth/tokens-devtwittercom .
I’ll be considering the uncertainty on the number of appearances of a tag using the Poisson Uncertainty on the number of observations, which is basically just the square root. The code:
def HashSearch(hashtag): """ This is the master function which will perform the twitter search for the hashtag, and find all other hashtags in those tweets. It will return a histogram of the frequency of other hashtags in tweets. """ # Eerything in lower case for simplicity hashtag = hashtag.lower() # CoTags will be the list of shared tags in tweets CoTags = [] # Total number of tweets discovered ntweets = 0 # This is the hashtag with no case or hash symbol basictag = hashtag.lower() basictag = basictag.replace('#','') # Create the twitter search object # You need your own keys and token from your twitter account # https://dev.twitter.com/docs/auth/tokens-devtwittercom ts = TwitterSearch( consumer_key = 'XXXXXXXXXXXXXXXXXXXXX', consumer_secret = 'YYYYYYYYYYYYYYYYYYYY', access_token = 'ZZZZZZZZZZZZZZZZZZZZZZZ', access_token_secret = 'WWWWWWWWWWWWWWWWWWW') # Create twitter search order for our hashtag, in english (en) # With setCount 100 (100 results) tso = TwitterSearchOrder() tso.setKeywords([hashtag]) tso.setLanguage('en') tso.setCount(100) tso.setIncludeEntities(False) # Loop over tweets in resutls. for tweet in ts.searchTweetsIterable(tso): # Use our cleaning/prasing function to get hashtags in tweet hashtags = FindHashHags(tweet) # Loop over hashtags for atag in hashtags: # Ignore our target hashtag! if basictag not in atag: # Add each hashtag to list of CoTags CoTags.append(atag) # Stop at 1000, that's enough if ntweets == 1000: break ntweets += 1 # Get histogram of values taghisto = collections.Counter(CoTags) # convert histogram to basic list like [['tag1',n1],['tag2',n2]] taghisto = [list(x) for x in sorted(taghisto.items(), key=lambda x: -x[1])] # Let's normalize everything to percentages, and get uncertainties ntweets = float(ntweets) # Loop over histogram bins for ibin in range(len(taghisto)): # The poisson uncertainty is the square root of counts for each tag uncertainty = math.sqrt(taghisto[ibin][1]) # Set counts to a percentage of total tweets in which tag occurs taghisto[ibin][1]= 100.*taghisto[ibin][1]/ntweets # Same for the uncertainty taghisto[ibin].append(100.*uncertainty/ntweets) # Return just the histogram information return taghisto
So for instance, we could do:
print HashSearch('#python')
and get back a frequency list like: [[‘ruby’, 5.6, 0.7483314773547882], [‘jobs’, 5.5, 0.7416198487095663], [‘java’, 5.0, 0.7071067811865476] ….
Step 3: Draw the results
Basically, we are going to draw a histogram showing the frequency of the most related tags. I’ll be showing up to 10 tags, and restricting to those that show up often enough to be statistically significant. Without further ado, the code:
def DrawHisto(taghisto,atag): """ This function is for drawing a png histogram of the output of the HashSearch function. """ # Let's get an ideal number of bins. This is a cosmetic # choice. I choose all bins where the error bar is less # than 30% of the bin content. After all, there is little # insight from single-events. No mor ethan 10. N = 0 for t in taghisto: if t[2]<0.3*t[1]: N+=1 if N==10: break # Get the list of labels, bin content, and bin errors # for the first N tweets. labels = ['#'+taghisto[n][0] for n in range(N)] content = [taghisto[n][1] for n in range(N)] errors = [taghisto[n][2] for n in range(N)] # Horizontal bar plot with error bars plt.barh(range(N), content,xerr=errors, align='center',alpha=0.4) # Set the y labels as the hashtags plt.yticks(range(N), labels) # Set x label and title plt.xlabel('Percent Shared Hashtags') plt.title('Shared hashtags for #'+mytag) # Cosmetic choice to adjust x axis plt.xlim(0.0,max(content)*1.2) # Labels can be big, auto-fix the layout plt.tight_layout() # Save to png plt.savefig(atag+'.png')
Step 4: Wrap it up
I’m going to put everything in a file called “Hashtagram.py” with a tag given as a command-line argument, and use just a couple lines to get the data and draw the histogram.
# This is the hashtag (system argument) mytag = sys.argv[1] # Now get the raw data for co-tags of the hashtag "mytag" myhisto = HashSearch('#'+mytag) # Draw the histogram! DrawHisto(myhisto,mytag)
Now we can run it in the command line like “python Hashtagram.py python”, and get a result like:
Of course we can use this tool to explore any hashtags for all parts of news and culture. There’s quite a few examples in the gallery below:
- People often tweet about biden and obama together, but it seems like most people care about his NCAA picks.
- For Russia, lots about ukraine and crimea, and still some lingering sochi from the olympics.
- For cosmos, some about actual cosmos, and some about the tv show.
- For boston, mostly about jobs and news.
- No twitter analytics is complete without a comprehensive yolo analysis. Swag is still a popular cultural item.
What would you analyze?
There we have it. From international politics, to coding and the universe, we have found out what items are trending together in the aggregate mind of the internet.
So that begs the question: What hashtags do you think would be insightful?