What your texts say about you

Gurudev Ilangovan

October 3, 2018


Introduction

Without data, you’re just an another person with an opinion - W. Edwards Deming

We all have intuitions, questions or even hypotheses if you will about people’s texting patterns. Tom uses a lot of emojis, Daisy is a cynic and Harry swears a lot. But unless we can substantiate our theories with data our guesses will remain speculations however educated.

I had a few questions. What are people’s most used words? What do these words tell about us? What are the most popular emojis? Are people positive or negative in general when they speak? To answer such questions I decided to look at chats and the place where I have the most chats is Whatsapp. So I decided to take one group chat from whatsapp for analysis. Obviously, blogging about chats with other people is tricky, because… it involves chats with other people. Apart from getting the said group’s members’ consent, I anonymized their names and I have taken utmost care not to publish the chats themselves as far as possible (Apart from a preview of the first 10 ones).

Processing

If you are interested only in the results, please ignore the code blocks wherever you see. Also, skip this section and head over to the analysis section right away.

If any of you want to repeat it for your personal chats, this is the method I followed:

  1. Go to the chat window that you want to analyze on whatsapp.
  2. Go to options and there will be an option to export the chat as email.
  3. Send it to your email and download the chat text file

I am using whatsapp from an Android phone and I followed the above steps to get the chat text file. Naturally, I searched for preexisting R packages to parse the text file. There were a few packages but they were outdated and incompatible with the current format of the export file. So I wrote my own parser.

Patterns

I am basically constructing pattern matchers using regular expressions but I am using an excellent package called rebus that generates the regex patterns using human readable functions. I am pretty sure the patterns I have generated won’t cover all the possibilities and so might not work for you but with a little tweaking, it can easily be made to work.

pacman::p_load(tidyverse, rebus, lubridate, tidytext, magrittr,
               sentimentr, magrittr, plotly, rtweet, ggthemes)

df_emojis <- rtweet::emojis

pattern_emojis <- df_emojis %>% pull(code) %>%  literal() %>% or1()

pattern_new_message <- 
  NEWLINE %R% one_or_more(DGT) %R% "/" %R% one_or_more(DGT) %R% "/" %R% 
  one_or_more(DGT) %R% "," %R%
  SPACE %R% one_or_more(DGT) %R% ":" %R% one_or_more(DGT) %R% 
  rebus::or(" AM", " PM") %R% " - "

pattern_phone <- 
  literal("+") %R% 
  one_or_more(DGT) %R% 
  SPACE %R%
  one_or_more(DGT) %R%
  SPACE %R%
  one_or_more(DGT)

pattern_name <- 
  one_or_more(ALPHA) %R%
  zero_or_more(SPACE) %R%
  zero_or_more(ALPHA) %R%
  zero_or_more(SPACE) %R%
  zero_or_more(ALPHA)

Parsing

After generating the patterns, I executed the below code block to parse the file. (Actually, I ran this and a few more commands to anonymize people. I also removed a list of words from the analysis in the interest of anonymity.)

# input file
txt_chat <- read_file("~/projects/sandbox/data/xxx.txt")

txt_split <- 
  txt_chat %>% 
  str_split(pattern_new_message) %>% 
  pluck(1)

txt_split <- txt_split[2:length(txt_split)]

txt_new_message_pattern <- 
  txt_chat %>% 
  str_extract_all(pattern_new_message) %>% 
  pluck(1)


df_chats <- 
  tibble(chat_time = txt_new_message_pattern,
         chat_text = txt_split) %>%
  mutate(sender = 
           chat_text %>% 
           str_extract(rebus::or(pattern_phone, pattern_name)),
         text = 
           chat_text %>% 
           str_replace(literal(sender), "") %>% 
           str_replace(zero_or_more(literal(":")), "") %>% 
           str_replace(SPACE, "") %>% 
           str_trim(),
         time = mdy_hm(chat_time), 
         emoji = str_extract(text, one_or_more(pattern_emojis)),
         emoji_count = str_count(text, pattern_emojis),
         media = text %>% str_detect(literal("<Media omitted>")),
         text = ifelse(media, "", text)
         ) %>%
  add_count(sender) %>%
  filter(!(sender %>% str_detect(DGT)),
         !(sender %>% str_detect(or1(c("created", "added", "left", "changed"))))) %>%
  mutate(id_chat = row_number()) %>% 
  select(id_chat, time, sender, text, media,  emoji, emoji_count)
Let’s see how the table looks.
id_chat time sender text media emoji emoji_count
1 2018-05-07 23:45:00 A My life 😂😂😂😂😂 FALSE 😂😂😂😂😂 5
2 2018-05-07 23:47:00 A Sorry for starting on a morbid note 😂😂 FALSE 😂😂 2
3 2018-05-07 23:47:00 A It just came FALSE NA 0
4 2018-05-07 23:48:00 B Wait FALSE NA 0
5 2018-05-07 23:48:00 B One bad joke is coming up FALSE NA 0
6 2018-05-07 23:49:00 B Poor network FALSE NA 0
7 2018-05-07 23:49:00 B TRUE NA 0
8 2018-05-07 23:50:00 A 😂😂😂😂😂 FALSE 😂😂😂😂😂 5
9 2018-05-07 23:52:00 B All pls to indicate that you exist by periodically posting said bad jokes. Nandri🙏🏼 FALSE 🙏 1
10 2018-05-07 23:53:00 A When guys say bad jokes they usually mean something like this FALSE NA 0

Awesome. Now that we have data in a tidy format, let’s delve into the analysis.

Analysis

Pet words

Overall word usage

What words do the group members use a lot overall?

df_words <- 
  df_chats %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words, "word") %>% 
  filter(!word %in% var_ignore_words) 

df_words %>% 
  add_count(word) %>% 
  mutate(rank = dense_rank(desc(n))) %>% 
  filter(rank <= 15) %>% 
  arrange(rank) %>% 
  mutate(word = fct_inorder(word) %>% fct_rev()) %>% 
  ggplot(aes(word, fill = sender)) +
  geom_bar() +
  coord_flip() +
  theme_minimal() +
  scale_fill_tableau()

The members seem to agree. A lot.

If you’re wondering how words like “and” and “the” are not the most common, kudos. Those words are indeed the most common but I have removed such words (commonly called “stop words” in text mining) because they are not informative.

Word usage by member

And what words do each of the members use? Let’s look at the five most used words.

df_words %>% 
  count(sender, word, sort = T) %>% 
  group_by(sender) %>% 
  arrange(desc(n), .by_group = T) %>% 
  slice(1:5) %>% 
  ungroup() %>% 
  mutate(word = word %>% fct_inorder() %>% fct_rev()) %>% 
  ggplot(aes(word, n, fill = sender)) +
  geom_col() +
  coord_flip() +
  theme_light() +
  facet_grid(sender ~ ., scales="free") +
  theme(strip.text.x = element_text(angle = 0)) +
  scale_fill_tableau()

Word usage made better

Well, most of the words are commonly used ones. Very predictable and not very informative. How do we find words that are important relative to each member? Enter tf-idf! Tf-idf(Term frequency - Inverse document frequency) is a concept used in information retrieval where in the importance of each term is measured relative to how commonly that term is used by others. Basically, if you are using a term a lot and that term is used a lot in general, the count does not accurately reflect who you are. It’s the words that you use a lot but others don’t that say more about you. That’s the basic intuition but the Wikipedia article is actually pretty accessible if you want to dig deeper.

df_words %>% 
  count(sender, word) %>% 
  filter(n > 15) %>% 
  bind_tf_idf(word, sender, n) %>% 
  group_by(sender) %>% 
  arrange(desc(tf_idf)) %>% 
  slice(1:5) %>% 
  ungroup() %>% 
  mutate(word = word %>% fct_inorder() %>% fct_rev()) %>% 
  ggplot(aes(word, tf_idf, fill = sender)) +
  geom_col() +
  coord_flip() +
  theme_light() +
  facet_grid(sender ~ ., scales="free") +
  theme(strip.text.x = element_text(angle = 0)) +
  scale_fill_tableau()

I feel very foggy personalities seem to emerge.

  • A - playful?
  • B - philosophical - Just look at the words: life, duty, book, days…
  • C - jejune?

Can actually look at more words but don’t want to give too much away. 😜

⌚ 4⃣ 😂😋😍

In case you are wondering what the heading means, it’s “Time for emojis”. I promise I’ll try not to do that again. Anyway, can we repeat the same analysis that we did with words for emojis?

Let’s put a smiley on that face

Using the same concept of tf-idf we can now find out which emojis characterize people better than raw counts.

df_emoji_split %>% 
  count(sender, emoji,description, sort = T) %>%
  bind_tf_idf(emoji, sender, n = n) %>% 
  group_by(sender) %>% 
  filter(tf_idf == max(tf_idf)) %>% 
  arrange(sender) %>%
  ungroup() %>% 
  mutate(sender = sender %>% fct_inorder() ) %>% 
  ggplot(aes(sender, y = 10, label = emoji)) +
  geom_text(size = 30) +
  theme_minimal() +
  theme(axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.text.x = element_text(size = 40),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank()) 

Emoji to word ratio

df_chats %>% 
  unnest_tokens(word, text) %>% 
  group_by(id_chat, sender) %>% 
  summarise(word_count = n()) %>% 
  left_join(
    df_chats %>% 
    # filter(!is.na(emoji)) %>% 
    select(id_chat, time, emoji_count),
    by = "id_chat"
  ) %>% 
  ungroup() %>%
  mutate(week = week(time)) %>% 
  group_by(sender, week) %>% 
  summarise(emoji_count_sum = sum(emoji_count),
            word_count_sum = sum(word_count),
            emoji_to_word = emoji_count_sum/word_count_sum) %>% 
  ggplot(aes(week, emoji_to_word, col = sender)) +
  ylim(0, NA) +
  geom_smooth(se = F) +
  scale_color_tableau() +
  theme_minimal()

The plot could mean one of two things:

  • Something genuinely funny or sad or emotional was being discussed in weeks 19 and 28.
  • A classic case of conformity or even, regression towards the mean? If one person uses a lot of emojis, other members tend to too. This seems more likely as there 3 lines follow the same trend from start to finish.

Also is A’s steady decline in emoji usage an indication of realization of roflitis as was quickly pointed out in the group? I may never know.

Getting sentient about sentiments

Finally let’s take a look at the chats from a sentiment perspective. Sentiment analysis is the process of analyzing a given corpus of text and then using some method to arrive at sentiment scores. Naively done, a score is assigned to a word and the scores of all words in a sentence are added. But a sentence like “The movie was not good at all” would still be positively evaluated by such a technique. So I am using a package called sentimentr which accounts for valence shifters (like ‘not’) and intensifiers (like ‘very’). For example,

tibble(
  text = c("The movie was horrible",
            "The movie was not good at all",
            "The movie was okay",
            "The movie was good",
            "The movie was very impressive")) %>% 
   mutate(num = row_number(),
          review = get_sentences(text)) %$% 
   sentiment_by(review, list(num, text)) %>%
   sentimentr::highlight()

sentiment_example

Now that we know what we can do, let’s calculate the sentiment scores for the chats.

df_sentiment <- 
  df_chats %>% 
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_sent = get_sentences(sentence),
         id_sentence = row_number())


df_sentiment <- 
  df_sentiment %$%
  sentiment_by(sentence_sent, list(id_sentence, sender)) %>% 
  left_join(df_sentiment, c("id_sentence","sender")) %>% 
  rename(sentiment = ave_sentiment) %>% 
  as_tibble()

The overall sentiment scores for the members.

df_sentiment %>% 
  filter(sentiment!= 0) %>% 
  mutate(rating = ifelse(sentiment > 0, "postive","negative"),
         sender = sender %>% fct_rev(),
         week = week(time)) %>% 
  group_by(sender, week, rating, ) %>% 
  summarise(sentiment_sum = sum(sentiment),
            sentiment_mean = mean(sentiment)) %>% 
  ungroup() %>% 
  ggplot(aes(sender, sentiment_sum, fill = rating)) +
  geom_col() +
  theme_minimal() +
  coord_flip() +
  scale_fill_hue(l=40) 

A and C have almost identical bars. B has a clear offset to the left. Perhaps my claim that B is a little philosophical might be true after all. Sometimes the more we think about life, less rosy it seems.

Also, this graph could be redrawn after accounting for the sentiments that emojis represent. Currently, sentimentr does not support emojis. However we could substitute the description of the emojis for the emojis themselves and then calculate the sentiments. With A’s roflitis, I suspect that this approach would make others seem too pessimistic in comparison. So, I didn’t go that route.

Conclusion

Admittedly, one group-chat tells only a small part of the story. If there were a way to get all whatsapp chats in a single step, we could do much richer analyses. For instance, chat behavior of the same person in different groups would be interesting to see. Similarly, it could be seen how the proportion of texts sent to each group or person varies across time. That could shed light into which bonds are becoming stronger and which ones are becoming weaker. Sentiment analysis on all our texts can help us understand how our moods vary. So many possibilities…

Any comments, feedback or suggestions are, as people say, 😍😍😍.

comments powered by Disqus