Functional programming

Readings and class materials for Tuesday, September 26, 2023

“{purrr} enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science.”

Source: Package description

The purpose of functional programming, as it is written in description of the package, is to implement iterations (recursions) in a readable manner in our code. It is going to be just as a huge advantage of R programming as the dplyr package for tabular data.

The approach is very similar to what we have seen with the apply family, where there is an input object and we apply the specified function to each of its elements. This was the lapply function we encountered earlier, as we discussed previously.

Illustrative example - load files

Lets suppose we have multiple .csv files in our working directory. These files are generated from the app Publish or Perish and contain google search results with different keywords.

The files can be downloaded from here. Copy the zipped files into your working directory. You can do this without any manual step:

t <- tempfile()
t
[1] "/var/folders/9f/4hrqlxmn4c3f6mk9hgwqjxmh0000gn/T//RtmpJCrjhi/filea917dfe53d2"
download.file("https://codeload.github.com/gist/4ed653c8e655d4ebabaa3071fc7b50a0/zip/5aef770bce91b661744ad8c8b1aed56643f795cf", destfile = t)
td <- tempdir()
td
[1] "/var/folders/9f/4hrqlxmn4c3f6mk9hgwqjxmh0000gn/T//RtmpJCrjhi"
unzip(t, junkpaths = TRUE, exdir = td)
file_names <- list.files(path = td, pattern = "csv", full.names = TRUE)
file_names
[1] "/var/folders/9f/4hrqlxmn4c3f6mk9hgwqjxmh0000gn/T//RtmpJCrjhi/daily-inflation-online.csv"         
[2] "/var/folders/9f/4hrqlxmn4c3f6mk9hgwqjxmh0000gn/T//RtmpJCrjhi/inflation-expectations-forecast.csv"

The advantage of this solution is that these files will be placed in a temporarily created folder and will be deleted along with the closing of the R session.

Let us recall how we constructed the lapply function call. The input here would be the two file names, and the function to be performed would be the read_csv function.

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.3     ✔ purrr   1.0.2
✔ tibble  3.2.1     ✔ dplyr   1.1.2
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
lapply(file_names, read_csv)
[[1]]
# A tibble: 998 × 26
   Cites Authors         Title  Year Source Publisher ArticleURL CitesURL GSRank
   <dbl> <chr>           <chr> <dbl> <chr>  <chr>     <chr>      <chr>     <dbl>
 1  3295 GP Compo, JS W… The …  2011 Quart… Wiley On… https://r… https:/…    875
 2   885 E Nakamura, J … High…  2018 The Q… academic… https://a… https:/…    364
 3  2708 E Castronova    Synt…  2008 Synth… degruyte… https://w… https:/…    498
 4   660 RC Cornes, G v… An e…  2018 Journ… Wiley On… https://a… https:/…    261
 5   164 MC Medeiros, G… Fore…  2021 Journ… Taylor &… https://w… https:/…    122
 6  5219 GW Schwert      Why …  1989 The j… Wiley On… https://o… https:/…    560
 7  1701 PR Hansen, A L… The …  2011 Econo… Wiley On… https://o… https:/…    565
 8  2573 LJ Christiano,… The …  2003 inter… Wiley On… https://o… https:/…    586
 9  4813 F Black         Noise  1986 The j… Wiley On… https://o… https:/…    333
10   220 C Binder        Coro…  2020 Revie… direct.m… https://d… https:/…    504
# ℹ 988 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
#   CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
#   EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
#   AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
#   RelatedURL <chr>

[[2]]
# A tibble: 980 × 26
   Cites Authors         Title  Year Source Publisher ArticleURL CitesURL GSRank
   <dbl> <chr>           <chr> <dbl> <chr>  <chr>     <chr>      <chr>     <dbl>
 1 30964 RF Engle        Auto…  1982 Econo… JSTOR     https://w… https:/…    872
 2  2592 GW Evans, S Ho… Lear…  2012 Learn… degruyte… https://w… https:/…    683
 3   250 O Coibion, Y G… Mone…  2022 Journ… journals… https://w… https:/…     22
 4   164 MC Medeiros, G… Fore…  2021 Journ… Taylor &… https://w… https:/…    847
 5   324 P Bordalo, N G… Over…  2020 Ameri… aeaweb.o… https://w… https:/…    518
 6   958 U Malmendier, … Lear…  2016 The Q… academic… https://a… https:/…    114
 7   951 O Coibion, Y G… Info…  2015 Ameri… aeaweb.o… https://w… https:/…    444
 8   247 O Coibion, Y G… Infl…  2020 Journ… Elsevier  https://w… https:/…     28
 9  2823 LEO Svensson    Infl…  1997 Europ… Elsevier  https://w… https:/…     38
10   110 AM Dietrich, K… News…  2022 Journ… Elsevier  https://w… https:/…    667
# ℹ 970 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
#   CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
#   EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
#   AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
#   RelatedURL <chr>

The outcome is a list with 2 elements.

The `purrr` package provides a similar solution, but its going to have its advantages.

map(file_names, read_csv)
[[1]]
# A tibble: 998 × 26
   Cites Authors         Title  Year Source Publisher ArticleURL CitesURL GSRank
   <dbl> <chr>           <chr> <dbl> <chr>  <chr>     <chr>      <chr>     <dbl>
 1  3295 GP Compo, JS W… The …  2011 Quart… Wiley On… https://r… https:/…    875
 2   885 E Nakamura, J … High…  2018 The Q… academic… https://a… https:/…    364
 3  2708 E Castronova    Synt…  2008 Synth… degruyte… https://w… https:/…    498
 4   660 RC Cornes, G v… An e…  2018 Journ… Wiley On… https://a… https:/…    261
 5   164 MC Medeiros, G… Fore…  2021 Journ… Taylor &… https://w… https:/…    122
 6  5219 GW Schwert      Why …  1989 The j… Wiley On… https://o… https:/…    560
 7  1701 PR Hansen, A L… The …  2011 Econo… Wiley On… https://o… https:/…    565
 8  2573 LJ Christiano,… The …  2003 inter… Wiley On… https://o… https:/…    586
 9  4813 F Black         Noise  1986 The j… Wiley On… https://o… https:/…    333
10   220 C Binder        Coro…  2020 Revie… direct.m… https://d… https:/…    504
# ℹ 988 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
#   CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
#   EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
#   AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
#   RelatedURL <chr>

[[2]]
# A tibble: 980 × 26
   Cites Authors         Title  Year Source Publisher ArticleURL CitesURL GSRank
   <dbl> <chr>           <chr> <dbl> <chr>  <chr>     <chr>      <chr>     <dbl>
 1 30964 RF Engle        Auto…  1982 Econo… JSTOR     https://w… https:/…    872
 2  2592 GW Evans, S Ho… Lear…  2012 Learn… degruyte… https://w… https:/…    683
 3   250 O Coibion, Y G… Mone…  2022 Journ… journals… https://w… https:/…     22
 4   164 MC Medeiros, G… Fore…  2021 Journ… Taylor &… https://w… https:/…    847
 5   324 P Bordalo, N G… Over…  2020 Ameri… aeaweb.o… https://w… https:/…    518
 6   958 U Malmendier, … Lear…  2016 The Q… academic… https://a… https:/…    114
 7   951 O Coibion, Y G… Info…  2015 Ameri… aeaweb.o… https://w… https:/…    444
 8   247 O Coibion, Y G… Infl…  2020 Journ… Elsevier  https://w… https:/…     28
 9  2823 LEO Svensson    Infl…  1997 Europ… Elsevier  https://w… https:/…     38
10   110 AM Dietrich, K… News…  2022 Journ… Elsevier  https://w… https:/…    667
# ℹ 970 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
#   CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
#   EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
#   AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
#   RelatedURL <chr>

The result is currently identical 🤷‍♂️. The difference will lie in the fact that while the simplification of lapply is sapply, which only works for vector outputs, the functions belonging to the map function family allow you to explicitly specify the desired output. In this particular case, our output consists of two tables with identical column names. It would be desirable to obtain a single binded table 👊🏻

map_dfr(file_names, read_csv)
# A tibble: 1,978 × 26
   Cites Authors         Title  Year Source Publisher ArticleURL CitesURL GSRank
   <dbl> <chr>           <chr> <dbl> <chr>  <chr>     <chr>      <chr>     <dbl>
 1  3295 GP Compo, JS W… The …  2011 Quart… Wiley On… https://r… https:/…    875
 2   885 E Nakamura, J … High…  2018 The Q… academic… https://a… https:/…    364
 3  2708 E Castronova    Synt…  2008 Synth… degruyte… https://w… https:/…    498
 4   660 RC Cornes, G v… An e…  2018 Journ… Wiley On… https://a… https:/…    261
 5   164 MC Medeiros, G… Fore…  2021 Journ… Taylor &… https://w… https:/…    122
 6  5219 GW Schwert      Why …  1989 The j… Wiley On… https://o… https:/…    560
 7  1701 PR Hansen, A L… The …  2011 Econo… Wiley On… https://o… https:/…    565
 8  2573 LJ Christiano,… The …  2003 inter… Wiley On… https://o… https:/…    586
 9  4813 F Black         Noise  1986 The j… Wiley On… https://o… https:/…    333
10   220 C Binder        Coro…  2020 Revie… direct.m… https://d… https:/…    504
# ℹ 1,968 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
#   CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
#   EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
#   AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
#   RelatedURL <chr>

One emerging issue is that we are unable to determine which observation originates from which file. The map_ provides a solution to this as we have an `.id` argument where we can specify the name to be given to the column that stores the id (1, 2, …) of the file. If the input would be a named list or vector, then it will be placed there.

map_dfr(file_names, read_csv, .id = "keyword")
# A tibble: 1,978 × 27
   keyword Cites Authors Title  Year Source Publisher ArticleURL CitesURL GSRank
   <chr>   <dbl> <chr>   <chr> <dbl> <chr>  <chr>     <chr>      <chr>     <dbl>
 1 1        3295 GP Com… The …  2011 Quart… Wiley On… https://r… https:/…    875
 2 1         885 E Naka… High…  2018 The Q… academic… https://a… https:/…    364
 3 1        2708 E Cast… Synt…  2008 Synth… degruyte… https://w… https:/…    498
 4 1         660 RC Cor… An e…  2018 Journ… Wiley On… https://a… https:/…    261
 5 1         164 MC Med… Fore…  2021 Journ… Taylor &… https://w… https:/…    122
 6 1        5219 GW Sch… Why …  1989 The j… Wiley On… https://o… https:/…    560
 7 1        1701 PR Han… The …  2011 Econo… Wiley On… https://o… https:/…    565
 8 1        2573 LJ Chr… The …  2003 inter… Wiley On… https://o… https:/…    586
 9 1        4813 F Black Noise  1986 The j… Wiley On… https://o… https:/…    333
10 1         220 C Bind… Coro…  2020 Revie… direct.m… https://d… https:/…    504
# ℹ 1,968 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
#   CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
#   EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
#   AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
#   RelatedURL <chr>

Since the outcome is a tibble we can simple modify this column to replace it with the corresponding file name 📁.

map_dfr(file_names, read_csv, .id = "keyword") %>% 
  mutate(
    keyword = file_names[as.numeric(keyword)],
    keyword = str_remove(keyword, ".*/"), # remove the path
    keyword = str_remove(keyword, ".csv") # remove extension
  )
# A tibble: 1,978 × 27
   keyword Cites Authors Title  Year Source Publisher ArticleURL CitesURL GSRank
   <chr>   <dbl> <chr>   <chr> <dbl> <chr>  <chr>     <chr>      <chr>     <dbl>
 1 daily-…  3295 GP Com… The …  2011 Quart… Wiley On… https://r… https:/…    875
 2 daily-…   885 E Naka… High…  2018 The Q… academic… https://a… https:/…    364
 3 daily-…  2708 E Cast… Synt…  2008 Synth… degruyte… https://w… https:/…    498
 4 daily-…   660 RC Cor… An e…  2018 Journ… Wiley On… https://a… https:/…    261
 5 daily-…   164 MC Med… Fore…  2021 Journ… Taylor &… https://w… https:/…    122
 6 daily-…  5219 GW Sch… Why …  1989 The j… Wiley On… https://o… https:/…    560
 7 daily-…  1701 PR Han… The …  2011 Econo… Wiley On… https://o… https:/…    565
 8 daily-…  2573 LJ Chr… The …  2003 inter… Wiley On… https://o… https:/…    586
 9 daily-…  4813 F Black Noise  1986 The j… Wiley On… https://o… https:/…    333
10 daily-…   220 C Bind… Coro…  2020 Revie… direct.m… https://d… https:/…    504
# ℹ 1,968 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
#   CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
#   EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
#   AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
#   RelatedURL <chr>

Alternative solution:

Let’s make a tibble from the file_names

tibble(file_names)
# A tibble: 2 × 1
  file_names                                                                    
  <chr>                                                                         
1 /var/folders/9f/4hrqlxmn4c3f6mk9hgwqjxmh0000gn/T//RtmpJCrjhi/daily-inflation-…
2 /var/folders/9f/4hrqlxmn4c3f6mk9hgwqjxmh0000gn/T//RtmpJCrjhi/inflation-expect…

The primary advantage of the map function is that we can simple use that within the tidy framework we saw last week:

tibble(file_names) %>% 
  mutate(
    data = map(file_names, read_csv)
  )
# A tibble: 2 × 2
  file_names                                                          data      
  <chr>                                                               <list>    
1 /var/folders/9f/4hrqlxmn4c3f6mk9hgwqjxmh0000gn/T//RtmpJCrjhi/daily… <spc_tbl_>
2 /var/folders/9f/4hrqlxmn4c3f6mk9hgwqjxmh0000gn/T//RtmpJCrjhi/infla… <spc_tbl_>

Yes, these are tibbles within a tibble (madness like “dream within a dream”) 🥴. The advantage of utilizing the functions provided by tibble and map lies in the fact that, contrary to base R data.frames, tibble-type data.frames can contain a list as a column, enabling the inclusion of any data type within the table. For instance, the file name can be one column (the keywords), all observations can be in the second column, and let’s say the average citation in the third.

tibble(file_names) %>% 
  mutate(
    data = map(file_names, read_csv),
    file_names = str_remove(file_names, ".*/"), # remove the path
    file_names = str_remove(file_names, ".csv") 
  ) %>% 
  pull(data) %>% # data column as a vector (list)
  pluck(1) # the first element
# A tibble: 998 × 26
   Cites Authors         Title  Year Source Publisher ArticleURL CitesURL GSRank
   <dbl> <chr>           <chr> <dbl> <chr>  <chr>     <chr>      <chr>     <dbl>
 1  3295 GP Compo, JS W… The …  2011 Quart… Wiley On… https://r… https:/…    875
 2   885 E Nakamura, J … High…  2018 The Q… academic… https://a… https:/…    364
 3  2708 E Castronova    Synt…  2008 Synth… degruyte… https://w… https:/…    498
 4   660 RC Cornes, G v… An e…  2018 Journ… Wiley On… https://a… https:/…    261
 5   164 MC Medeiros, G… Fore…  2021 Journ… Taylor &… https://w… https:/…    122
 6  5219 GW Schwert      Why …  1989 The j… Wiley On… https://o… https:/…    560
 7  1701 PR Hansen, A L… The …  2011 Econo… Wiley On… https://o… https:/…    565
 8  2573 LJ Christiano,… The …  2003 inter… Wiley On… https://o… https:/…    586
 9  4813 F Black         Noise  1986 The j… Wiley On… https://o… https:/…    333
10   220 C Binder        Coro…  2020 Revie… direct.m… https://d… https:/…    504
# ℹ 988 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
#   CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
#   EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
#   AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
#   RelatedURL <chr>
tibble(file_names) %>% 
  mutate(
    data = map(file_names, read_csv),
    file_names = str_remove(file_names, ".*/"), # remove the path
    file_names = str_remove(file_names, ".csv"),
    avg_cite = map(data, ~ mean(.$Cites, na.rm = TRUE))
  )
# A tibble: 2 × 3
  file_names                      data                  avg_cite 
  <chr>                           <list>                <list>   
1 daily-inflation-online          <spc_tbl_ [998 × 26]> <dbl [1]>
2 inflation-expectations-forecast <spc_tbl_ [980 × 26]> <dbl [1]>

Almost… The 3rd column is now list, although we know that it contains only one number. We can explicitly force to return a dbl vector.

tibble(file_names) %>% 
  mutate(
    data = map(file_names, read_csv),
    file_names = str_remove(file_names, ".*/"), # remove the path
    file_names = str_remove(file_names, ".csv"),
    avg_cite = map_dbl(data, ~ mean(.$Cites, na.rm = TRUE)) #<
  )
# A tibble: 2 × 3
  file_names                      data                  avg_cite
  <chr>                           <list>                   <dbl>
1 daily-inflation-online          <spc_tbl_ [998 × 26]>     111.
2 inflation-expectations-forecast <spc_tbl_ [980 × 26]>     165.
But why do we use the map?

If you refer to a column of the tibble inside a dplyr verb, then the function will take it as a vector by default. For instance, if we use the length function, we would get 2. We have to use map to evaluate the function on the elements of a column one-by-one.

tibble(file_names) %>% 
  mutate(
    data = map(file_names, read_csv),
    file_names = str_remove(file_names, ".*/"), # remove the path
    file_names = str_remove(file_names, ".csv"),
    l = length(data),
    l2 = map_dbl(data, length)
  )
# A tibble: 2 × 4
  file_names                      data                      l    l2
  <chr>                           <list>                <int> <dbl>
1 daily-inflation-online          <spc_tbl_ [998 × 26]>     2    26
2 inflation-expectations-forecast <spc_tbl_ [980 × 26]>     2    26

Nested tibbles

The above seen functionality, that we can store a list as a column of a tibble is great, but what if we need the whole tables as one df. Well, we can simple unnest the columns.

tibble(file_names) %>% 
  mutate(
    data = map(file_names, read_csv),
    file_names = str_remove(file_names, ".*/"), # remove the path
    file_names = str_remove(file_names, ".csv"),
    avg_cite = map_dbl(data, ~ mean(.$Cites, na.rm = TRUE)) #<
  ) %>% 
  unnest(data) 
# A tibble: 1,978 × 28
   file_names     Cites Authors Title  Year Source Publisher ArticleURL CitesURL
   <chr>          <dbl> <chr>   <chr> <dbl> <chr>  <chr>     <chr>      <chr>   
 1 daily-inflati…  3295 GP Com… The …  2011 Quart… Wiley On… https://r… https:/…
 2 daily-inflati…   885 E Naka… High…  2018 The Q… academic… https://a… https:/…
 3 daily-inflati…  2708 E Cast… Synt…  2008 Synth… degruyte… https://w… https:/…
 4 daily-inflati…   660 RC Cor… An e…  2018 Journ… Wiley On… https://a… https:/…
 5 daily-inflati…   164 MC Med… Fore…  2021 Journ… Taylor &… https://w… https:/…
 6 daily-inflati…  5219 GW Sch… Why …  1989 The j… Wiley On… https://o… https:/…
 7 daily-inflati…  1701 PR Han… The …  2011 Econo… Wiley On… https://o… https:/…
 8 daily-inflati…  2573 LJ Chr… The …  2003 inter… Wiley On… https://o… https:/…
 9 daily-inflati…  4813 F Black Noise  1986 The j… Wiley On… https://o… https:/…
10 daily-inflati…   220 C Bind… Coro…  2020 Revie… direct.m… https://d… https:/…
# ℹ 1,968 more rows
# ℹ 19 more variables: GSRank <dbl>, QueryDate <dttm>, Type <chr>, DOI <chr>,
#   ISSN <lgl>, CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
#   EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
#   AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
#   RelatedURL <chr>, avg_cite <dbl>
Note

You may realize that the original unnested columns are copied to each corresponding observation.

We can simply use the nest function if we want to achieve the opposite.

map_dfr(file_names, read_csv, .id = "keyword") %>% 
  mutate(
    keyword = file_names[as.numeric(keyword)],
    keyword = str_remove(keyword, ".*/"), # remove the path
    keyword = str_remove(keyword, ".csv") # remove extension
  ) %>%
  nest(
    data = - keyword, # everything except "keyword" to the "data" column
    .by = keyword
    )
# A tibble: 2 × 2
  keyword                         data               
  <chr>                           <list>             
1 daily-inflation-online          <tibble [998 × 26]>
2 inflation-expectations-forecast <tibble [980 × 26]>
map_dfr(file_names, read_csv, .id = "keyword") %>% 
  mutate(
    keyword = file_names[as.numeric(keyword)],
    keyword = str_remove(keyword, ".*/"), # remove the path
    keyword = str_remove(keyword, ".csv") # remove extension
  ) %>%
  group_by(keyword) %>%
  nest()
# A tibble: 2 × 2
# Groups:   keyword [2]
  keyword                         data               
  <chr>                           <list>             
1 daily-inflation-online          <tibble [998 × 26]>
2 inflation-expectations-forecast <tibble [980 × 26]>

Exercise 1

Lets open the url of the 5 most cited articles by the 2 topics, which is newer than 10 years, and the abstarct is about the US.

Tip

The walk function works similarly like map, but it does not return any value, it is useful if you want to generate side-effects (like opening something in your browser, with the browseURL).

Solution:

map_dfr(file_names, read_csv, .id = "keyword") %>% 
  mutate(
    keyword = file_names[as.numeric(keyword)],
    keyword = str_remove(keyword, ".*/"), # remove the path
    keyword = str_remove(keyword, ".csv") # remove extension
  ) %>%
  filter(Year >= lubridate::year(Sys.Date()) - 10) %>% 
  filter(str_detect(Abstract, " US | USA")) %>% 
  group_by() %>% 
  slice_max(Cites, n = 10, by = keyword) %>% 
  arrange(Cites) %>% 
  pull(ArticleURL) %>% 
  walk(browseURL)

Exercise 2

Let us create a simulation to determine the optimal investment ratio (\(f\)) given a probability, (\(p\)), of doubling our invested money and a probability of \(1-p\) of losing it. We will play this game for a total of 200 rounds. What should be the value of \(f\), given a specific value of \(p\), in order to achieve maximum return?

library(tidyverse)

coins <- function(n = 1000, p = .55) {
  ifelse(runif(n - 1) > (1 - p), 2, 0)
}

ret <- function(p = .5, f = .5, n = 200, m = 10000) {
  accumulate(
    .x = coins(n, p),
    .f = ~ {
      keep = floor(.x * (1 - f))
      play = .x - keep
      earned = floor(keep + play * .y)
      # message("K:", round(keep), "p", round(play), "x", round(.y, 4), "e", round(earned))
      earned
    },
    .init = m
  )
}
params_df <- crossing(
  p = seq(from = .5, to = .6, length.out = 3),
  f = seq(from = 0, to = .7, length.out = 10)
)

params_df
# A tibble: 30 × 2
       p      f
   <dbl>  <dbl>
 1   0.5 0     
 2   0.5 0.0778
 3   0.5 0.156 
 4   0.5 0.233 
 5   0.5 0.311 
 6   0.5 0.389 
 7   0.5 0.467 
 8   0.5 0.544 
 9   0.5 0.622 
10   0.5 0.7   
# ℹ 20 more rows
results_df <- params_df %>% 
  mutate(
    r = map2(p, f, ~ {
      replicate(10000, last(ret(p = .x, f = .y)), simplify = TRUE)
    }, .progress = TRUE), # .progress: show progress line
    avg_r = map_dbl(r, mean)
  )
 ■■                                 3% |  ETA:  2m
 ■■■                                7% |  ETA:  2m
 ■■■■                              10% |  ETA:  2m
 ■■■■■                             13% |  ETA:  2m
 ■■■■■■                            17% |  ETA:  2m
 ■■■■■■■                           20% |  ETA:  2m
 ■■■■■■■■                          23% |  ETA:  2m
 ■■■■■■■■■                         27% |  ETA:  1m
 ■■■■■■■■■■                        30% |  ETA:  1m
 ■■■■■■■■■■■                       33% |  ETA:  1m
 ■■■■■■■■■■■■                      37% |  ETA:  1m
 ■■■■■■■■■■■■■                     40% |  ETA:  1m
 ■■■■■■■■■■■■■■                    43% |  ETA:  1m
 ■■■■■■■■■■■■■■■                   47% |  ETA:  1m
 ■■■■■■■■■■■■■■■■                  50% |  ETA:  1m
 ■■■■■■■■■■■■■■■■■                 53% |  ETA:  1m
 ■■■■■■■■■■■■■■■■■■                57% |  ETA:  1m
 ■■■■■■■■■■■■■■■■■■■               60% |  ETA: 49s
 ■■■■■■■■■■■■■■■■■■■■              63% |  ETA: 44s
 ■■■■■■■■■■■■■■■■■■■■■             67% |  ETA: 40s
 ■■■■■■■■■■■■■■■■■■■■■■            70% |  ETA: 37s
 ■■■■■■■■■■■■■■■■■■■■■■■           73% |  ETA: 33s
 ■■■■■■■■■■■■■■■■■■■■■■■■          77% |  ETA: 29s
 ■■■■■■■■■■■■■■■■■■■■■■■■■         80% |  ETA: 25s
 ■■■■■■■■■■■■■■■■■■■■■■■■■■        83% |  ETA: 20s
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■       87% |  ETA: 16s
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■      90% |  ETA: 12s
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■     93% |  ETA:  8s
 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■    97% |  ETA:  4s
results_df
# A tibble: 30 × 4
       p      f r                  avg_r
   <dbl>  <dbl> <list>             <dbl>
 1   0.5 0      <dbl [10,000]> 10000    
 2   0.5 0.0778 <dbl [10,000]> 10483.   
 3   0.5 0.156  <dbl [10,000]>  9619.   
 4   0.5 0.233  <dbl [10,000]> 15987.   
 5   0.5 0.311  <dbl [10,000]>  2677.   
 6   0.5 0.389  <dbl [10,000]>   517.   
 7   0.5 0.467  <dbl [10,000]>  1707.   
 8   0.5 0.544  <dbl [10,000]>     0.319
 9   0.5 0.622  <dbl [10,000]>     0    
10   0.5 0.7    <dbl [10,000]>     0    
# ℹ 20 more rows
library(gt)

results_df %>%
  select(-r) %>%
  mutate(p = format(p, digits = 2)) %>%
  pivot_wider(names_from = p, values_from = avg_r) %>%
  gt() %>%
  fmt_number(-f, decimals = 0) %>%
  fmt_number(f, decimals = 2) %>%
  tab_spanner("p", -1) %>% 
  data_color(2) %>%
  data_color(3) %>%
  data_color(4)
f p
0.50 0.55 0.60
0.00 10,000 10,000 10,000
0.08 10,067 46,030 212,650
0.16 10,287 210,835 4,184,787
0.23 6,046 919,002 87,429,495
0.31 2,259 1,936,808 1,425,716,604
0.39 610 1,594,350 5,149,530,456
0.47 51 354,241 19,601,120,279
0.54 0 55,296 34,492,225,580
0.62 0 370 803,305,667
0.70 0 0 309,720