Hashtags in Common: Visualizations with Python and TwitterSearch

If you actively use twitter, you probably know that it’s common for tweets to have several hashtags. If you search for tweets with a particular hashtag, you often see other related hashtags repeating over and over again as certain topics trend together.

If that information was distilled down to a single graphic that tells you what hashtags are frequently occurring together, you could easily get some quick insight about current related news or cultural items. The topic of today’s post will be making such a graphic.

But first, we will need to build a little machinery to get the job done. I’ll be using the TwitterSearch API for python to get the data, and matplotlib/pyplot for displaying results. This is the full set of imports:

from TwitterSearch import *
import matplotlib.pyplot as plt
import collections,sys,math 

Step 1: Parsing the tweet.

Tweets can be ugly, and I don’t mean in their verbal content. In an ideal world, we would be streaming in tweets that look like:

New #cosmos series is great public outreach for #science and #physics

But often what we see are things like:

WOAH just saw #COSMOs #Science!!#physics.

So, we have some cleaning to do. What we need is a function that separates hashtags from other words, puts everything in the same case, cleans punctuation, removes possessives (like “#obama’s statement”), etc. This is my solution:

def FindHashHags(tweet):
	This function takes the twittersearch output tweet,
	cleans up the text and the format, and returns
	the set of all hashtags in the tweet
	# First get the tweet text
	tweettxt = tweet['text'].encode('ascii','ignore')
	# People sometimes stack hashtags with no spacing
	# Add spacing before the hashtag symbol
	tweettxt = tweettxt.replace('#',' #')
	# Clean all punctuation which sometimes 
	# gets cluttered in with the tag
	for punct in '.!",;:%<>/~`()[]{}?':
		tweettxt = tweettxt.replace(punct,'')
	# Split the tweet string into a list of words,
	# some of which will be hashtagged
	# print tweettxt
	tweettxt = tweettxt.split()
	# Initiatie list of hashtags
	hashtags = []
	# Loop over the words in the tweet
	for word in tweettxt:
		# Find words beginning with hashtag
		if word[0]=='#':
			# Lower-case the word
			hashtag = word.lower()
			# Correct for possisives
			hashtag= hashtag.split('\'')[0]			
			# Get rid of the hashtag symbol
			hashtag = hashtag.replace('#','')
			# Make sure there is text left, append to list
			if len(hashtag)>0:
	# return clean list of hashtags
	return hashtags

To follow up on our first example, we can do:

MyTweet = {}
MyTweet['text'] = 'WOAH just saw #COSMOs #Science!!#physics.'
print FindHashHags(MyTweet)

and we will just see: [‘cosmos’, ‘science’, ‘physics’]

Step 2: Searching and collecting hashtags

TwitterSearch will allow us to search with a hashtag. We want input a phrase like “python” and get back the frequencies of the top related hashtags in the last ~1000 tweets. To to do so you need to be a twitter developer to run the search, and you will need your own consumer/access keys and secret. See: https://dev.twitter.com/docs/auth/tokens-devtwittercom .

I’ll be considering the uncertainty on the number of appearances of a tag using the Poisson Uncertainty on the number of observations, which is basically just the square root. The code:

def HashSearch(hashtag):
	This is the master function which will perform the twitter 
	search for the hashtag, and find all other hashtags in those
	tweets. It will return a histogram of the frequency of other
	hashtags in tweets. 
	# Eerything in lower case for simplicity
	hashtag = hashtag.lower()

	# CoTags will be the list of shared tags in tweets
	CoTags = []
	# Total number of tweets discovered
	ntweets = 0
	# This is the hashtag with no case or hash symbol
	basictag = hashtag.lower()
	basictag = basictag.replace('#','')

	# Create the twitter search object
	# You need your own keys and token from your twitter account
	# https://dev.twitter.com/docs/auth/tokens-devtwittercom
	ts = TwitterSearch(
	consumer_secret = 'YYYYYYYYYYYYYYYYYYYY',
	access_token_secret = 'WWWWWWWWWWWWWWWWWWW')

	# Create twitter search order for our hashtag, in english (en)
	# With setCount 100 (100 results)
	tso = TwitterSearchOrder() 

	# Loop over tweets in resutls. 
	for tweet in ts.searchTweetsIterable(tso): 
		# Use our cleaning/prasing function to get hashtags in tweet
		hashtags = FindHashHags(tweet)
		# Loop over hashtags
		for atag in hashtags:
			# Ignore our target hashtag!
			if basictag not in atag:
				# Add each hashtag to list of CoTags
		# Stop at 1000, that's enough
		if ntweets == 1000:
		ntweets += 1

	# Get histogram of values 
	taghisto = collections.Counter(CoTags)
	# convert histogram to basic list like [['tag1',n1],['tag2',n2]]
	taghisto = [list(x) for x in sorted(taghisto.items(), key=lambda x: -x[1])]
	# Let's normalize everything to percentages, and get uncertainties
	ntweets = float(ntweets)
	# Loop over histogram bins
	for ibin in range(len(taghisto)):
		# The poisson uncertainty is the square root of counts for each tag
		uncertainty = math.sqrt(taghisto[ibin][1])
		# Set counts to a percentage of total tweets in which tag occurs
		taghisto[ibin][1]= 100.*taghisto[ibin][1]/ntweets
		# Same for the uncertainty

	# Return just the histogram information
	return taghisto

So for instance, we could do:

print HashSearch('#python')

and get back a frequency list like: [[‘ruby’, 5.6, 0.7483314773547882], [‘jobs’, 5.5, 0.7416198487095663], [‘java’, 5.0, 0.7071067811865476] ….

Step 3: Draw the results

Basically, we are going to draw a histogram showing the frequency of the most related tags. I’ll be showing up to 10 tags, and restricting to those that show up often enough to be statistically significant. Without further ado, the code:

def DrawHisto(taghisto,atag):
	This function is for drawing a png histogram of the
	output of the HashSearch function.
	# Let's get an ideal number of bins. This is a cosmetic
	# choice. I choose all bins where the error bar is less
	# than 30% of the bin content. After all, there is little
	# insight from single-events. No mor ethan 10.
	N = 0
	for t in taghisto:
		if t[2]<0.3*t[1]:
		if N==10:
	# Get the list of labels, bin content, and bin errors
	# for the first N tweets.		
	labels = ['#'+taghisto[n][0] for n in range(N)]
	content = [taghisto[n][1] for n  in range(N)]
	errors = [taghisto[n][2] for n  in range(N)]

	# Horizontal bar plot with error bars
	plt.barh(range(N), content,xerr=errors, align='center',alpha=0.4)
	# Set the y labels as the hashtags
	plt.yticks(range(N), labels)
	# Set x label and title
	plt.xlabel('Percent Shared Hashtags')
	plt.title('Shared hashtags for #'+mytag)
	# Cosmetic choice to adjust x axis
	# Labels can be big, auto-fix the layout
	# Save to png

Step 4: Wrap it up

I’m going to put everything in a file called “Hashtagram.py” with a tag given as a command-line argument, and use just a couple lines to get the data and draw the histogram.

# This is the hashtag (system argument)
mytag = sys.argv[1]
# Now get the raw data for co-tags of the hashtag "mytag"
myhisto = HashSearch('#'+mytag)
# Draw the histogram!

Now we can run it in the command line like “python Hashtagram.py python”, and get a result like:

People talking about python  are also talking about ruby, jobs, java, etc.

People talking about python are also talking about ruby, jobs, java, etc.

Of course we can use this tool to explore any hashtags for all parts of news and culture. There’s quite a few examples in the gallery below:

What would you analyze?

There we have it. From international politics, to coding and the universe, we have found out what items are trending together in the aggregate mind of the internet.

So that begs the question: What hashtags do you think would be insightful?