We will be building a Streamlit Web App to showcase a word cloud of Trending Google Keywords and Twitter Hashtags in 2020
The Link to the live app and screenshots of some of the word clouds are at the end of the article
Introduction
We will be getting our data from the following website
https://us.trend-calendar.com/trend/2020-01-01.html
The above website stores archives of the trending keywords and hashtags on each day. Beautiful Soup will be used to scrape this website to get the required data. We will be building the following features
- A 2020 word cloud
- The ability for the user to select a date and generate a word cloud for that date
- The ability for the user to change the image mask
Pre-Requisite
- Basic Familiarity with Web Scraping using Beautiful Soup
- Knowledge of Streamlit is not necessary for generating the word clouds but a basic understanding of Streamlit is required to build the web app
Install the Required Packages📦
We will need to install the following libraries
- Pandas
- Streamlit
- WordCloud
- Matplotlib
- BeautifulSoup
pip install pandas,streamlit,wordcloud,matplotlib,bs4
Acquiring the Data 📈
The website mentioned above follows the following format
https://us.trend-calendar.com/trend/{date}.html
The {date} has to replaced by the date we are interested in. It has to be in the YYYY-MM-DD format. For ease, we will scrape the data in intervals of 7 days,i.e [2020–01–01, 2020–01–08, 2020–01–15, 2020–01–22 …… ]
Generating the Dates
Pandas has a function date_range() which is like the range() function but for dates. The function takes the start date, end date, and frequency as parameters
def get_dates():
dates = pd.date_range('2020-01-01','2020-12-27',freq='7d')
dates = [d.strftime('%Y-%m-%d') for d in dates]
return dates
Defining Function to get data for given day
We will only store the top 10 keywords and hashtags.
If you inspect the HTML of the website, you will notice the following
- There are two ‘ol’ elements. Both their class names are ‘ranking’
- The first element contains the Twitter Hashtags and the second element contains the Google Keywords
- Inside the ‘ol’ element, the keywords/hashtags are stored inside ‘li’ elements.
def get_keywords(date):
result = {}
url = f'https://us.trend-calendar.com/trend/{date}.html'
r = requests.get(url)
if r.status_code != 200:
print(f'Failed to get data from {url}')
soup = BeautifulSoup(r.text, "html.parser")
The requests library will be used to make a request to our website and the page returned will be used to initialize a beautiful soup object
Before storing the data we will do the following pre-processing
- Remove hashtag characters from the beginning of the Twitter hashtags
- Converting all the data to lower case
Word clouds can use weights to vary the font size of certain words. In our case, we will assign a weight of 10 to the first ranked keyword and decrement weight for keywords lower in the ranking. Therefore the 10th keyword will have a weight of 1 and the 5th keyword will have a weight of 5.
try:
twitter_trends = soup.find_all('ol' , 'ranking')[0].find_all('li')[0:10]
for idx,trend in enumerate(twitter_trends):
trend = trend.text.lstrip("#").lower()
result[trend] = result.get(trend, 0) + (10 - idx)
except Exception as e:
print(e)
print(f'Failed to get twitter Hashtags from {url}')
The above code gets the Twitter hashtags. If a keyword already exists, we add the current weight to the previous weight. This might be useful if a word is found in both the google and Twitter lists.
try:
google_trends = soup.find_all('ol' , 'ranking')[1].find_all('li')[0:10]
for idx,trend in enumerate(google_trends):
trend = trend.text.lower()
result[trend] = result.get(trend, 0) + (10 - idx)
except Exception as e:
print(e)
print(f'Failed to get twitter Hashtags from {url}')
print(f"Scraped Data for {date} successfully")
The above code gets the Google Keywords.
Collecting Data for all Dates and storing it in a file
dates = get_dates()
keywords = {}
for date in dates:
keywords[date] = get_keywords(date)
A request for each date will be made to the website and data stored inside a dictionary
Once our dictionary is ready, we will store the data inside the JSON file
with open('data/weekly.json','w') as file:
json.dump(keywords,file)
A new JSON file must be created to combine all the words
combined_result = {}
for date , week_keyword in keywords.items():
for keyword in week_keyword:
combined_result[keyword] = combined_result.get(keyword,0)
week_keyword[keyword]
This JSON file will not store the dates, it will only store the word and its weight. It will be used to produce the 2020 word cloud.
Creating Word Clouds ☁️
We will use the wordcloud library to create the word clouds. First, we will create a default word cloud (see below) without any image mask
The following piece of code creates a word cloud object
wordcloud = WordCloud(width, height, repeat,
max_words,max_font_size,background_color)
- width- Width of the word cloud
- height– Height of the word cloud
- repeat– Boolean value. If set to True, words will be repeated to fill up blank spaces. If set to False, blank spaces will be visible
- max_words– A maximum number of words inside the word cloud
- max_font_size– The maximum font size of a word, the word with the maximum weight will have the max_font_size
- background_color– By default, it is set to black. However, we can change it
To create fancier word clouds like below, we will need to create image masks.
The PIL library will be used to open the image and numpy will be used to create the mask array
path = f'data/image_masks/{image}.jpg'
mask = np.array(Image.open(path))
The path variable should point to the base Image which will be used to create the mask, you can find some example images in my GitHub repo. I have provided a link to it at the end of the article.
This newly created mask variable needs to be passed as a parameter while initializing the word cloud.
wordcloud = WordCloud(width, height, repeat,
max_words,max_font_size,background_color, mask = mask)
The data for the word cloud can either be in the form of a large string or a dictionary with weights. In our case, it is the latter. The wordcloud object has a method generate_from_frequencies which takes in the dictionary with weights as a parameter and creates the wordcloud.
Since we will give the user the ability to chose the image mask, we will put the above code inside a function
def get_word_cloud(image,data,max_words,max_font_size):
if image == 'default':
wordcloud = WordCloud(width=400, height=400, repeat=True,
max_words=max_words, max_font_size=
max_font_size,background_color='white',
).generate_from_frequencies(data)
else:
path = f'data/image_masks/{image}.jpg'
mask = np.array(Image.open(path))
wordcloud = WordCloud(width=400, height=400, repeat=True,
max_words=max_words,max_font_size=
max_font_size,background_color='white',
mask = mask).generate_from_frequencies(data)
return wordcloud
The above function will return the wordcloud based on the given parameters.
Streamlit App 💡
Before writing any code for the streamlit app, we will need to load the data from our JSON files
def load_data():
with open('data/weekly.json','r') as file:
weekly_keywords = json.load(file)
with open('data/combined.json') as file:
combined_keyword = json.load(file)
dates = [date for date in weekly_keywords]
return combined_keyword,weekly_keywords,dates
We will also return all the dates for which we collected the data.
st.title("2020 Word Clouds based on Google Keyword and Twitter Hashtag trends")
image = st.sidebar.selectbox(label='Select Image Mask',options=
['default','twitter','hashtag','heart'])
combined_keyword,weekly_keywords,dates = load_data()
A sidebar with a dropdown will be created for the user to select the image mask they want to use.
For the 2020 word cloud, we will set the maximum number of words to 800 and maximum font size to 15
st.header("Entire Year")
wordcloud = get_word_cloud(image,combined_keyword,800,15)
fig1 = plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
st.pyplot(fig1)
For the weekly cloud, we can increase the font size since we do not have many unique words. We will also create a dropdown for the user to select a date
st.header("Weekly")
date = st.selectbox(label='Select Date',options=dates)
keywords = weekly_keywords[date]
wordcloud = get_word_cloud(image , keywords,200,25)
fig2 = plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
st.pyplot(fig2)
Conclusion
This mini project could be further improved, more specifically the way we acquire the data can be improved. Below are a few suggestions
- Currently the dates we generated to get the data for are all Wednesdays. As a result, certain hashtags like ‘WednesdayWisdom’ or ‘WednesdayMorning’ are present in our data. The intervals between the generated dates could be randomized or some sort of pre-processing could be used to remove the words from our data
- Use a different data source. The website we scrape the data from is a 3rd party website and might have incorrect data.
- Increase options for the image masks
Please mention some other ways to improve the app in the comments 😃
Some of the Word Clouds are below
Resources
Github Repo
https://github.com/rahulbanerjee26/Word_Clouds
Live
https://share.streamlit.io/rahulbanerjee26/word_clouds/main/app.py
Deploy Streamlit App