Do I have enough tweets?

In my recent post Retrieving bulk twitter data, I used the twitterscraper Python package to scrape 13 years of tweets using 24 hashtags as search terms. In this post I will try to:

use the tweets I already have to learn if there are additional hashtags I need to search for
see how well igraph and ggraph scale by recreating the ggraph plot from this blog post with a much larger dataset

Have I missed any important Twitter communities?

I searched Twitter for 24 hashtags, and it is entirely possible that I have missed a few important ones. By looking at the most frequent hashtags in the corpus I should be able to see if I have missed anything important.

To start, I need to import all of the tweets from their CSV files.

library(tidyverse)
file_names <- list.files("~/data/twitter", full.names = TRUE)
tweet_files <- vector(mode = "list", length = length(file_names))
for (i in seq_along(file_names)) {
  tweet_files[[i]] <- read_csv(file_names[i])
}
tweets_raw <- bind_rows(tweet_files)
tweets_raw %>% select(fullname, text)

## # A tibble: 411,153 x 2
##    fullname        text                                                   
##    <chr>           <chr>                                                  
##  1 𝓟𝓪𝓸𝓵𝓸           Differential expression analysis with Bioconductor and…
##  2 Paul Blaser     Gene Ontology analysis with #Python and #Bioconductor …
##  3 Lalit Kapoor    #bioconductor for analyzing genome data.. hmm looks in…
##  4 Darren Wilkins… Looking for a CDF (or a #bioconductor cdf package) for…
##  5 Davide          Nice #R and #bioconductor manual http://faculty.ucr.ed…
##  6 Seth Falcon     Pair programming some improvements to the graph packag…
##  7 Chris Lasher    Eff it, I'm going for a ride. I'll resume beating my h…
##  8 Neil Saunders   Error in dyn.load(file, DLLpath = DLLpath, ...) undefi…
##  9 Michael Dewar   is anyone an #Rstats #bioconductor genius? I have a ne…
## 10 Ricardo Vidal   Somewhat abusing biomaRt... #R #Bioconductor           
## # ... with 411,143 more rows

I also need to de-duplicate the dataset, as there are many tweets that used multiple hashtags and I don’t want to give them unnecessary weight.

tweets <- tweets_raw %>% select(-html) %>% distinct
pct_unique <- ( nrow(tweets) / nrow(tweets_raw) ) * 100
glue::glue("There are {nrow(tweets)} unique tweets out of \\
           {nrow(tweets_raw)} total ({round(pct_unique,2)}% unique).")

## There are 403833 unique tweets out of 411153 total (98.22% unique).

Using the tidytext package I can now do a little bit of text mining to see if there are any important hashtags I should add to my Twitter search list.

library(tidytext)
tweet_hashtags <- tweets %>% 
  unnest_tokens(word, text, "tweets", strip_punct=TRUE) %>% 
  filter(str_detect(word, "^#")) %>% 
  count(word, sort = TRUE)
tweet_hashtags

## # A tibble: 61,704 x 2
##    word                  n
##    <chr>             <int>
##  1 #rstats          377405
##  2 #datascience      57521
##  3 #bigdata          18830
##  4 #python           16435
##  5 #machinelearning  13993
##  6 #dataviz          11703
##  7 #analytics        10479
##  8 #r                10179
##  9 #ai                6905
## 10 #r4ds              6315
## # ... with 61,694 more rows

Nothing is jumping out as me (I’m looking for hashtags which are indicative of a community that I might have missed) so I think I’ll stick with my current list of hashtags.

Can I plot this using ggraph?

I’ve got a niggling feeling that I’m going to encounter problems with the scale of the dataset. igraph is written in C so it should scale fairly well, but with over 400,000 tweets it is important to get a feel for how these packages are going to perform on my laptop. If I can calculate the graph from my previous post (which was built from only 1,500 tweets) then I’ll be a lot more comfortable with my ability to do the rest of the project using igraph (and tidygraph/ggraph by extension).

library(rtweet)
library(igraph)
library(ggraph)

rt_g <- tweets %>% 
  filter(retweets > 0) %>% 
  select(user, text) %>% 
  unnest_tokens(word, text, "tweets", strip_punct=TRUE) %>% 
  filter(str_detect(word, "^@")) %>% 
  mutate(word = str_remove(word, "@")) %>% 
  graph_from_data_frame()

V(rt_g)$node_label <- unname(ifelse(degree(rt_g)[V(rt_g)] > 20, names(V(rt_g)), "")) 
V(rt_g)$node_size <- unname(ifelse(degree(rt_g)[V(rt_g)] > 20, degree(rt_g), 0)) 

ggraph(rt_g, layout = 'linear', circular = TRUE) + 
  geom_edge_arc(edge_width=0.125, aes(alpha=..index..)) +
  geom_node_label(aes(label=node_label, size=node_size),
                  label.size=0, fill="#ffffff66", segment.colour="slateblue",
                  color="red", repel=TRUE, fontface="bold") +
  coord_fixed() +
  scale_size_area(trans="sqrt") +
  labs(title="Retweet Relationships", subtitle="Most retweeted screen names labeled. Darkers edges == more retweets. Node size == larger degree") +
  theme_graph() +
  theme(legend.position="none")

Well that looks terrible, but the point of this activity is showing that igraph/tidygraph/ggraph will work with this much data. It took a fair while to finish calculating, but it got there in the end. For now I will call this a success as it looks like I should be able to work on this project without any special tools.

I’m happy with both the data and the tooling, so I think I’m ready to start on the analysis now!