Mapping the #rstats Twitter community

As an exercise in proving out some of the technical requirements for this project, I’m going to follow along with Bob Rudis’ 21 Recipes for Mining Twitter Data with rtweet book.

For this exercise, I will only need the tidyverse, rtweet, igraph and ggraph packages.

library(tidyverse)
library(rtweet)
library(igraph)
library(ggraph)

Setting up OAuth

If you’re running something like this for the first time, you will need to register your application with Twitter and authenticate with OAuth. You only need to do this process once - as long as you have a valid token (and you have the location of that token stored in the TWITTER_PAT environment variable) then you don’t need to do this every time you want to connect to Twitter.

twitter_token <- create_token(
  app = Sys.getenv("TWITTER_APP"),
  consumer_key = Sys.getenv("TWITTER_CONSUMER_KEY"),
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_SECRET")
)
saveRDS(twitter_token, "~/.rtweet.rds")

Now we can test that everying is working by grabbing the top Twitter trends for Sydney.

aus_trends <- get_trends("sydney")
aus_trends %>% select(trend)

## # A tibble: 42 x 1
##    trend                  
##  * <chr>                  
##  1 #UFC229                
##  2 Quentin Kenihan        
##  3 Andrew Gaff            
##  4 #ConstellationCup      
##  5 Safety Car             
##  6 #repTourDallas         
##  7 #BTSxCitiField         
##  8 Argentina              
##  9 #Banksy                
## 10 Productivity Commission
## # ... with 32 more rows

These sure don’t match the trends for Sydney on the Twitter website, which is almost exclusively sporting hashtags at the moment. Thankfully we don’t really need to figure out what’s broken here as we won’t be using Twitter’s trends functionality, but at least we can confirm that the OAuth token is working.

Grabbing some #rstats tweets

If everything does what it says on the box, then I should be able to grab a big pile of tweets with just one command: rtweet::search_tweets(). Let’s give it a go!

rstats_sample <- search_tweets("#rstats", n = 1000, include_rts = FALSE)
rstats_sample %>% select(name, text) %>% sample_n(10)

## # A tibble: 10 x 2
##    name               text                                                
##    <chr>              <chr>                                               
##  1 uwe sterr          "I just😍  knitr::include_url, use it in #bookdown a…
##  2 Agnese Vardanega   A very very simple function to export tables in csv…
##  3 Thomas Hütter      Posted by Alex Joseph, now on R-bloggers: Comparing…
##  4 R-bloggers         Three new domain-specific (embedded) languages with…
##  5 Bio Lab Analytics… Principal components analysis (PCA) is a powerful #…
##  6 Emily Webb         #rstats https://t.co/twi1YZQpQC                     
##  7 LIBD rstats club   "We are getting closer to publishing our blog post …
##  8 R-bloggers         America – get ready for EARL! https://t.co/zl4cRtoL…
##  9 Tyler Morgan-Wall  Don't worry, I didn't forget about 3D ggplot render…
## 10 R-bloggers         12 Best Data Science Resources on the Internet http…

We can now have a bit of a dig into the data to see who has been getting lots of retweets, and what they have been tweeting about.

top_retweets <- rstats_sample %>% 
  group_by(name) %>% 
  tally(retweet_count, sort=TRUE)
top_retweets

## # A tibble: 469 x 2
##    name                     n
##    <chr>                <int>
##  1 Dr. GP Pulipaka        925
##  2 R-bloggers             712
##  3 Kirk Borne             359
##  4 Mine CetinkayaRundel   254
##  5 Mara Averick           184
##  6 RStudio                155
##  7 Aspioneer              131
##  8 Hank Hershey            58
##  9 Sharon Machlis          55
## 10 boB Rudis               33
## # ... with 459 more rows

rstats_sample %>% 
  arrange(desc(retweet_count)) %>% 
  select(retweet_count, name, text)

## # A tibble: 983 x 3
##    retweet_count name           text                                      
##            <int> <chr>          <chr>                                     
##  1           254 Mine Cetinkay… Teaching (with) R? Consider adding your c…
##  2           155 RStudio        r2d3: R Interface to D3 Visualizations ht…
##  3           152 Kirk Borne     Colossal Collection of Convenient Cheat S…
##  4            57 Hank Hershey   Coming soon to a github repository near y…
##  5            52 Sharon Machlis "My #rstats book Practical R for Mass Com…
##  6            47 Kirk Borne     Build a strong foundational knowledge of …
##  7            46 Kirk Borne     "A Big List of Lists of #DataScience and …
##  8            38 Kirk Borne     Chart Suggestions — a thought-starter for…
##  9            37 R-bloggers     Animating a Monte Carlo Simulation https:…
## 10            37 Dr. GP Pulipa… "Free eBook: Data Science Algorithms in a…
## # ... with 973 more rows

It is of course entirely unsurprising to find that Mara Averick (@dataandme) features on this leaderboard.

Turning Twitter data into a graph

I’ll stick to following the Bob Rudis book again here, with future blog posts to look at doing this in a more targeted way. For now the objectives are to:

join all the dots together so I know I have my local R environment playing nicely with rtweet and igraph
make some pretty images so I can make sure blogdown is ready for future blog posts

rstats <- search_tweets("#rstats", n=1500)

rt_g <- filter(rstats, retweet_count > 0) %>% 
  select(screen_name, mentions_screen_name) %>%
  unnest(mentions_screen_name) %>% 
  filter(!is.na(mentions_screen_name)) %>% 
  graph_from_data_frame()

V(rt_g)$node_label <- unname(ifelse(degree(rt_g)[V(rt_g)] > 20, names(V(rt_g)), "")) 
V(rt_g)$node_size <- unname(ifelse(degree(rt_g)[V(rt_g)] > 20, degree(rt_g), 0)) 

ggraph(rt_g, layout = 'linear', circular = TRUE) + 
  geom_edge_arc(edge_width=0.125, aes(alpha=..index..)) +
  geom_node_label(aes(label=node_label, size=node_size),
                  label.size=0, fill="#ffffff66", segment.colour="slateblue",
                  color="red", repel=TRUE, fontface="bold") +
  coord_fixed() +
  scale_size_area(trans="sqrt") +
  labs(title="Retweet Relationships", subtitle="Most retweeted screen names labeled. Darkers edges == more retweets. Node size == larger degree") +
  theme_graph() +
  theme(legend.position="none")

Cool! It worked!

For the next blog post I’ll look to take a much larger sample, cleaning up the mess that is “retweeting” (because retweets are important in the community, but producers are obviously more important) and then using the igraph to try and identify distinct communities.