As a part of a Natural Language Processing project, focused on the US presidential debates and election 2020, I collected a total amount of 325.314 unique tweets. In this post, I will explain how I managed to do this, utilizing Twitter’s API and R Studio.
Twitter is built on the robust conversations happening around the world via Tweets. With Twitter’s API platform, you’ll find endpoints to capitalize on the functionality and data contained within Tweets, and to build great experiences and solutions for your customers. These endpoints enable you to manage your Tweets, publish and curate Tweets, filter and search for Tweet topics or trends, and much more (Twitter, 2020).
The first step, in order to collect tweets, is to acquire an API key for Twitter’s Rest API. This is done by applying for one at Twitter’s developer portal1, luckily for me this process was very brief, and I got my API key within a day.
Having access to Twitter Developers Portal and API, enabled me to set up an application, which could then be used, to connect my R environment with Twitter’s API. This would allow me to manipulate the API using R, i.e., searching for Tweet topics or trends at specified time frames.
Figure 1.1 Configuring application on Twitter Developer Portal
Note: It’s important, when configuring an application, to set a callback URL to 127.0.0.1:1410. This allows rtweet package to authenticate with your Twitter account.
To connect R to the newly created application, an acccess token has to be created, using the application’s API key and secret. I used the create_token() function from rtweet package (Kearney, 2019) to do this, configuring it as follows:
library(rtweet)
<- 'XXXXXXXXXXXXXXXXX' #this is not a key
key <- 'XXXXXXXXXXXXXXXXXX'
secret
# Connecting to Twitter API
<- create_token(
twitter_token app = "presidentialdebate",
consumer_key = key,
consumer_secret = secret
)
Using create_token() from the rtweet package, saves this token as a variable in .Renviron, allowing one to use Twitter’s API without configuring a new token every session.
Having set up an access token, Twitter’s API can now be manipulated through R, using the rtweet package (Kearney, 2019). To collect Tweets already sent, Tweets can be searched using the search_tweets() function, with the main parameters being q (search query) and n (number of desired tweets). As I wanted to perform multiple queries/searches with different search queries, number of tweets and time frames, around both the first and final presidential debate, I decided to create a .csv file (twitter_data.csv), where each row contains the specifications of a query/search.
Columns in twitter_data.csv
By doing this, I won’t have to hard program all these different search specifications, into separate function calls, but can instead programatically pass each row to a search_tweets() call. This could be done by looping through the rows of twitter_data.csv, passing the specified parameters to search_tweets(), and saving the resulting data.frame as a RData (.rds) file at each iteration. For this purpose I wrapped search_tweets() along with saveRDS() to save the queries as .rds files, in a new function I named twitter_grab():
<- function(term, n, since, until, type, file) {
twitter_grab <- search_tweets(q = term,
tweets n = n,
include_rts = FALSE,
lang = "en",
since = since,
until = until,
type = type,
retryonratelimit = TRUE
)
saveRDS(tweets, file = file)
cat(file, "saved")
}
As the Twitter API is limited, to only return tweets sent during the past 7 days, I had to collect the data on multiple occasions. First collecting tweets immediately after the first debate, then 7 days after and doing the same for the final debate. There are probably more elegant ways of doing this e.g., Cron jobs, but I decided to call the loop manually on multiple occasions, specifying, when calling the loop, which rows to query at each occasion.
# Grabbing tweets programmatically using a .csv file
<- read.csv('twitter_data.csv', stringsAsFactors = FALSE) #.csv file containing paths and search specs
ref
# Looping throug data.frame ref calling twitter_grab() function
for (row in 31:nrow(ref)) {
twitter_grab(
$term[row],
ref$n[row],
ref$since[row],
ref$until[row],
ref$type[row],
ref$path[row]
ref
) }
The above loop being the last I used collect Tweets sent during the week following the final Presidential debate. This loop iterates over the following subset of twitter_data.csv, and pass the search specifications to my function twitter_grab().
In total the Twitter data collection resulted in 42 queries (one for each row in twitter_data.csv), which was then saved as 42 individual RData (.rds) files.
Now that I have collected the Tweets, in order to use this data in a Natural Language Processing project, the text variable needs to be cleaned, i.e., removing links, mentions, hashtags and other special characters. To do this, I found some different examples on StackOverflow, using reprex to clean Tweets and modified this into a function:
library(tidyverse)
library(data.table)
# Cleaning function modified from StackOverflow
<- function(x) {
clean_tweets %>%
x str_remove_all(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>% #removes links
str_replace_all("&", "and") %>% #replaces & with and
str_remove_all("^RT:? ") %>% #removes retweets
str_remove_all("@[[:alnum:]]*") %>% #removes mentions
str_remove_all("#[[:alnum:]]*") %>% #removes hashtags
str_remove_all("(?![.,])[[:punct:]]") %>% #removes punctuation except . and ,
str_replace_all("\\\n", " ") %>%
str_to_lower() %>% # Converts to lower case
str_trim("both") # Removes whitespace on both sides of string
}
Having a function that can clean the text variable of the Twitter data, I now need to load in the 42 .rds files, clean the text variable, and save 42 new .rds files keeping a backup of the raw data. To do this I will again use twitter_data.csv, first to read in the aforementioned data files, but also to save the cleaned data using the altpath column. Furthermore, Twitter data contains many variables (around 90) and as I’m only interested in looking at a few of these, and also have a backup of the raw data, I will select the variables of interest and retain these in the new data files. To do this I initialize a the R object voi containing the names of the variables to keep.
# Cleaning tweets programmatically
<- read.csv('twitter_data.csv', stringsAsFactors = FALSE) # .csv file containing path and altpath
ref
<- c('user_id', 'status_id', 'created_at',
voi 'text', 'hashtags', 'location',
'retweet_count', 'favorite_count', 'followers_count') # Variables of interest
Much like when collecting the Tweets I now loop through twitter_data.csv, loading each file using the path column and selecting the variables of interest or voi. Then the text variable of each file, is being passed to my function clean_tweets() before saving the data.frame as a new .rds file using the column altpath in twitter_data.csv.
# Looping through ref reading in datasets containing tweets and cleaning text variable
for (row in 1:nrow(ref)) {
<- readRDS(ref$path[row]) %>%
df select(all_of(voi))
$text <- clean_tweets(df$text)
dfsaveRDS(df, ref$altpath[row])
}
Having an individual file for each query is nice for backup purposes, however, for analysis purposes, loading 42 individual files and applying the same operations on each doesn’t seem effective. Therefore, in this next section I will join the files together, to create two distinct data.frames during and after the first and final debate.
In order to join the files together I need to load in the newly created clean .rds files. I do this by creating a list of data.frames and calling it df_list. I then use the function rbindlist, to bind the rows from the files pertaining to the first debate together, and remove duplicates using the function distinct on the variable status_id. The resulting data.frame is then saved as a new .rds file, and the same is done for the files pertaining to the final debate.
# Loading cleaned datasets as list of dataframes
<- lapply(ref$altpath, readRDS)
df_list <- read.csv('twitter_data_clean.csv', stringsAsFactors = FALSE) # .csv file containing paths for cleaned data
ref_clean
# Combining cleaned datasets into two dataframes with regard to collection date
<- rbindlist(df_list[1:18]) %>%
presdebate_first distinct(status_id, .keep_all = TRUE) %>%
saveRDS(ref_clean$path[1])
<- rbindlist(df_list[19:39]) %>%
presdebate_final distinct(status_id, .keep_all = TRUE) %>%
saveRDS(ref_clean$path[2])
This resulted in two RData .rds files containing a total of 325.314 unique tweets, which are now ready for analysis. If you are interested in checking out this project in more depth, the complete source code can be found on GitHub.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Dernild (2020, Dec. 2). Alexander Dernild: Collecting and preparing Tweets for a NLP project. Retrieved from https://dernild.com/posts/2020-12-02-collecting-and-preparing-tweets-for-a-nlp-project/
BibTeX citation
@misc{dernild2020collecting, author = {Dernild, Alexander}, title = {Alexander Dernild: Collecting and preparing Tweets for a NLP project}, url = {https://dernild.com/posts/2020-12-02-collecting-and-preparing-tweets-for-a-nlp-project/}, year = {2020} }