Regex

Readings and class materials for Tuesday, September 26, 2023

Motivation

“Strings are not glamorous, high-profile components of R, but they do play a big role in many data cleaning and preparation tasks. The stringr package provides a cohesive set of functions designed to make working with strings as easy as possible.”

Source: Package description

CHEATSHEETS

Tip

All functions within stringr are prefixed with str_ and require a vector of strings as the primary argument. This design choice facilitates the effortless identification of the desired string manipulation function (just type “str_” and use the TAB to browse).

courses <- c("Big data", "Behavioral economics", "Dynamic macroeconomics 2", "Communication", "Economic instituions")

courses

[1] "Big data"                 "Behavioral economics"    
[3] "Dynamic macroeconomics 2" "Communication"           
[5] "Economic instituions"

Basics

stringr is also part of the tidyverse, so you do not have to load it individually.

library(tidyverse)

Combine strings:

str_c(courses, " 2")

[1] "Big data 2"                 "Behavioral economics 2"    
[3] "Dynamic macroeconomics 2 2" "Communication 2"           
[5] "Economic instituions 2"

Which subject is about economics?

str_detect(courses, "economics")

[1] FALSE  TRUE  TRUE FALSE FALSE

Of course, these functions can also be used in the structure seen earlier (in a tidy format).

tibble(courses)

# A tibble: 5 × 1
  courses                 
  <chr>                   
1 Big data                
2 Behavioral economics    
3 Dynamic macroeconomics 2
4 Communication           
5 Economic instituions

tibble(courses) %>% 
  mutate(
    about_economics = str_detect(courses, "economic")
  )

# A tibble: 5 × 2
  courses                  about_economics
  <chr>                    <lgl>          
1 Big data                 FALSE          
2 Behavioral economics     TRUE           
3 Dynamic macroeconomics 2 TRUE           
4 Communication            FALSE          
5 Economic instituions     FALSE

Warning

If you look carefully at the outcome, you can see that these functions are cAsE sENsItIVE (the FALSE value in the last row).

Solution 1. - convert everything to lower case

tibble(courses) %>% 
  mutate(
    courses = str_to_lower(courses),
    about_economics = str_detect(courses, "economic")
  )

# A tibble: 5 × 2
  courses                  about_economics
  <chr>                    <lgl>          
1 big data                 FALSE          
2 behavioral economics     TRUE           
3 dynamic macroeconomics 2 TRUE           
4 communication            FALSE          
5 economic instituions     TRUE

Solution 2. - detect with lower and upper case

tibble(courses) %>% 
  mutate(
    about_economics = str_detect(courses, "economic|Economic")
  )

# A tibble: 5 × 2
  courses                  about_economics
  <chr>                    <lgl>          
1 Big data                 FALSE          
2 Behavioral economics     TRUE           
3 Dynamic macroeconomics 2 TRUE           
4 Communication            FALSE          
5 Economic instituions     TRUE

Solution 3. - detect with lower and upper first letter

tibble(courses) %>% 
  mutate(
    about_economics = str_detect(courses, "[eE]conomic")
  )

# A tibble: 5 × 2
  courses                  about_economics
  <chr>                    <lgl>          
1 Big data                 FALSE          
2 Behavioral economics     TRUE           
3 Dynamic macroeconomics 2 TRUE           
4 Communication            FALSE          
5 Economic instituions     TRUE

Regex

Most string functions work with regular expressions, a concise language for describing patterns of text.

[eE]conomic was an example to regular expressions (regex): “e” or “E”

Regex has a great number of special characters that we can use to describe the patterns we are looking for

For example: \\d is for any numeric character

tibble(courses) %>% 
  mutate(
    about_economics = str_detect(courses, "economic"),
    not_single_course = str_detect(courses, "\\d")
  )

# A tibble: 5 × 3
  courses                  about_economics not_single_course
  <chr>                    <lgl>           <lgl>            
1 Big data                 FALSE           FALSE            
2 Behavioral economics     TRUE            FALSE            
3 Dynamic macroeconomics 2 TRUE            TRUE             
4 Communication            FALSE           FALSE            
5 Economic instituions     FALSE           FALSE

\\s is for whitespaces (space/new line/tabulator)

tibble(courses) %>% 
  mutate(
    contain_spaces = str_detect(courses, "\\s")
  )

# A tibble: 5 × 2
  courses                  contain_spaces
  <chr>                    <lgl>         
1 Big data                 TRUE          
2 Behavioral economics     TRUE          
3 Dynamic macroeconomics 2 TRUE          
4 Communication            FALSE         
5 Economic instituions     TRUE

\\w is for letters (but all of them contain letters)

tibble(courses) %>% 
  mutate(
    contain_letter = str_detect(courses, "\\w")
  )

# A tibble: 5 × 2
  courses                  contain_letter
  <chr>                    <lgl>         
1 Big data                 TRUE          
2 Behavioral economics     TRUE          
3 Dynamic macroeconomics 2 TRUE          
4 Communication            TRUE          
5 Economic instituions     TRUE

Tip

Each of the regex expressions presented previously has an opposite. The same code in upper case. For instance, \\W is for non-letter characters (numbers or white-spaces)

tibble(courses) %>% 
  mutate(
    contain_non_letter = str_detect(courses, "\\W")
  )

# A tibble: 5 × 2
  courses                  contain_non_letter
  <chr>                    <lgl>             
1 Big data                 TRUE              
2 Behavioral economics     TRUE              
3 Dynamic macroeconomics 2 TRUE              
4 Communication            FALSE             
5 Economic instituions     TRUE

There are several other functions in the {stringr} package. We will cover a few in the following examples.

tibble(courses) %>% 
  mutate(
    n_non_letter = str_count(courses, "\\W"),
    n_character = str_length(courses)
  )

# A tibble: 5 × 3
  courses                  n_non_letter n_character
  <chr>                           <int>       <int>
1 Big data                            1           8
2 Behavioral economics                1          20
3 Dynamic macroeconomics 2            2          24
4 Communication                       0          13
5 Economic instituions                1          20

Extract date from url

https://economaniablog.hu/2022/09/14/how-to-forecast-the-business-cycle-sentiment-speaks/

x <- "https://economaniablog.hu/2022/09/14/how-to-forecast-the-business-cycle-sentiment-speaks/"

str_extract(x, "20\\d\\d/\\d\\d/\\d\\d")

[1] "2022/09/14"

An alternative solution:

str_extract(x, "[\\d/-]{3,}") %>% # digit, / or - and more than 3
  str_remove("[/-]$") %>% # if it is at the end
  str_remove("^[/-]") # if it is at the beginning

[1] "2022/09/14"

Caution

Those who want to work with webscraping and/or text analysis tools will really need to learn how to use the {stringr} functions!

Is it a website?

str_starts(x, "https://")

[1] TRUE

Remove the base url, assuming that its length is always the same

str_sub(x, end = 26)

[1] "https://economaniablog.hu/"

https://economaniablog.hu/2022/09/14/how-to-forecast-the-business-cycle-sentiment-speaks/

Remove the base url, assuming it lasts until the date

str_replace(x, ".*20\\d\\d/\\d\\d/\\d\\d/", "")

[1] "how-to-forecast-the-business-cycle-sentiment-speaks/"

Here the . refers to anything, and * denotes any repetition. Thus .* before the pattern means anything before the pattern, and .* after the pattern means anything after the pattern.