Readings and class materials for Tuesday, September 26, 2023
“{purrr} enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science.”
The purpose of functional programming, as it is written in description of the package, is to implement iterations (recursions) in a readable manner in our code. It is going to be just as a huge advantage of R programming as the dplyr package for tabular data.
The approach is very similar to what we have seen with the apply family, where there is an input object and we apply the specified function to each of its elements. This was the lapply function we encountered earlier, as we discussed previously.
Illustrative example - load files
Lets suppose we have multiple .csv files in our working directory. These files are generated from the app Publish or Perish and contain google search results with different keywords.
The files can be downloaded from here. Copy the zipped files into your working directory. You can do this without any manual step:
The advantage of this solution is that these files will be placed in a temporarily created folder and will be deleted along with the closing of the R session.
Let us recall how we constructed the lapply function call. The input here would be the two file names, and the function to be performed would be the read_csv function.
[[1]]
# A tibble: 998 × 26
Cites Authors Title Year Source Publisher ArticleURL CitesURL GSRank
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 3295 GP Compo, JS W… The … 2011 Quart… Wiley On… https://r… https:/… 875
2 885 E Nakamura, J … High… 2018 The Q… academic… https://a… https:/… 364
3 2708 E Castronova Synt… 2008 Synth… degruyte… https://w… https:/… 498
4 660 RC Cornes, G v… An e… 2018 Journ… Wiley On… https://a… https:/… 261
5 164 MC Medeiros, G… Fore… 2021 Journ… Taylor &… https://w… https:/… 122
6 5219 GW Schwert Why … 1989 The j… Wiley On… https://o… https:/… 560
7 1701 PR Hansen, A L… The … 2011 Econo… Wiley On… https://o… https:/… 565
8 2573 LJ Christiano,… The … 2003 inter… Wiley On… https://o… https:/… 586
9 4813 F Black Noise 1986 The j… Wiley On… https://o… https:/… 333
10 220 C Binder Coro… 2020 Revie… direct.m… https://d… https:/… 504
# ℹ 988 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
# CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
# EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
# AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
# RelatedURL <chr>
[[2]]
# A tibble: 980 × 26
Cites Authors Title Year Source Publisher ArticleURL CitesURL GSRank
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 30964 RF Engle Auto… 1982 Econo… JSTOR https://w… https:/… 872
2 2592 GW Evans, S Ho… Lear… 2012 Learn… degruyte… https://w… https:/… 683
3 250 O Coibion, Y G… Mone… 2022 Journ… journals… https://w… https:/… 22
4 164 MC Medeiros, G… Fore… 2021 Journ… Taylor &… https://w… https:/… 847
5 324 P Bordalo, N G… Over… 2020 Ameri… aeaweb.o… https://w… https:/… 518
6 958 U Malmendier, … Lear… 2016 The Q… academic… https://a… https:/… 114
7 951 O Coibion, Y G… Info… 2015 Ameri… aeaweb.o… https://w… https:/… 444
8 247 O Coibion, Y G… Infl… 2020 Journ… Elsevier https://w… https:/… 28
9 2823 LEO Svensson Infl… 1997 Europ… Elsevier https://w… https:/… 38
10 110 AM Dietrich, K… News… 2022 Journ… Elsevier https://w… https:/… 667
# ℹ 970 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
# CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
# EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
# AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
# RelatedURL <chr>
The result is currently identical 🤷♂️. The difference will lie in the fact that while the simplification of lapply is sapply, which only works for vector outputs, the functions belonging to the map function family allow you to explicitly specify the desired output. In this particular case, our output consists of two tables with identical column names. It would be desirable to obtain a single binded table 👊🏻
# A tibble: 1,978 × 26
Cites Authors Title Year Source Publisher ArticleURL CitesURL GSRank
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 3295 GP Compo, JS W… The … 2011 Quart… Wiley On… https://r… https:/… 875
2 885 E Nakamura, J … High… 2018 The Q… academic… https://a… https:/… 364
3 2708 E Castronova Synt… 2008 Synth… degruyte… https://w… https:/… 498
4 660 RC Cornes, G v… An e… 2018 Journ… Wiley On… https://a… https:/… 261
5 164 MC Medeiros, G… Fore… 2021 Journ… Taylor &… https://w… https:/… 122
6 5219 GW Schwert Why … 1989 The j… Wiley On… https://o… https:/… 560
7 1701 PR Hansen, A L… The … 2011 Econo… Wiley On… https://o… https:/… 565
8 2573 LJ Christiano,… The … 2003 inter… Wiley On… https://o… https:/… 586
9 4813 F Black Noise 1986 The j… Wiley On… https://o… https:/… 333
10 220 C Binder Coro… 2020 Revie… direct.m… https://d… https:/… 504
# ℹ 1,968 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
# CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
# EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
# AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
# RelatedURL <chr>
One emerging issue is that we are unable to determine which observation originates from which file. The map_ provides a solution to this as we have an `.id` argument where we can specify the name to be given to the column that stores the id (1, 2, …) of the file. If the input would be a named list or vector, then it will be placed there.
# A tibble: 2 × 2
file_names data
<chr> <list>
1 /var/folders/9f/4hrqlxmn4c3f6mk9hgwqjxmh0000gn/T//RtmpJCrjhi/daily… <spc_tbl_>
2 /var/folders/9f/4hrqlxmn4c3f6mk9hgwqjxmh0000gn/T//RtmpJCrjhi/infla… <spc_tbl_>
Yes, these are tibbles within a tibble (madness like “dream within a dream”) 🥴. The advantage of utilizing the functions provided by tibble and map lies in the fact that, contrary to base R data.frames, tibble-type data.frames can contain a list as a column, enabling the inclusion of any data type within the table. For instance, the file name can be one column (the keywords), all observations can be in the second column, and let’s say the average citation in the third.
tibble(file_names)%>%mutate( data =map(file_names, read_csv), file_names =str_remove(file_names, ".*/"), # remove the path file_names =str_remove(file_names, ".csv"))%>%pull(data)%>%# data column as a vector (list)pluck(1)# the first element
# A tibble: 998 × 26
Cites Authors Title Year Source Publisher ArticleURL CitesURL GSRank
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 3295 GP Compo, JS W… The … 2011 Quart… Wiley On… https://r… https:/… 875
2 885 E Nakamura, J … High… 2018 The Q… academic… https://a… https:/… 364
3 2708 E Castronova Synt… 2008 Synth… degruyte… https://w… https:/… 498
4 660 RC Cornes, G v… An e… 2018 Journ… Wiley On… https://a… https:/… 261
5 164 MC Medeiros, G… Fore… 2021 Journ… Taylor &… https://w… https:/… 122
6 5219 GW Schwert Why … 1989 The j… Wiley On… https://o… https:/… 560
7 1701 PR Hansen, A L… The … 2011 Econo… Wiley On… https://o… https:/… 565
8 2573 LJ Christiano,… The … 2003 inter… Wiley On… https://o… https:/… 586
9 4813 F Black Noise 1986 The j… Wiley On… https://o… https:/… 333
10 220 C Binder Coro… 2020 Revie… direct.m… https://d… https:/… 504
# ℹ 988 more rows
# ℹ 17 more variables: QueryDate <dttm>, Type <chr>, DOI <chr>, ISSN <lgl>,
# CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
# EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
# AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
# RelatedURL <chr>
tibble(file_names)%>%mutate( data =map(file_names, read_csv), file_names =str_remove(file_names, ".*/"), # remove the path file_names =str_remove(file_names, ".csv"), avg_cite =map(data, ~mean(.$Cites, na.rm =TRUE)))
If you refer to a column of the tibble inside a dplyr verb, then the function will take it as a vector by default. For instance, if we use the length function, we would get 2. We have to use map to evaluate the function on the elements of a column one-by-one.
tibble(file_names)%>%mutate( data =map(file_names, read_csv), file_names =str_remove(file_names, ".*/"), # remove the path file_names =str_remove(file_names, ".csv"), l =length(data), l2 =map_dbl(data, length))
# A tibble: 2 × 4
file_names data l l2
<chr> <list> <int> <dbl>
1 daily-inflation-online <spc_tbl_ [998 × 26]> 2 26
2 inflation-expectations-forecast <spc_tbl_ [980 × 26]> 2 26
Nested tibbles
The above seen functionality, that we can store a list as a column of a tibble is great, but what if we need the whole tables as one df. Well, we can simple unnest the columns.
tibble(file_names)%>%mutate( data =map(file_names, read_csv), file_names =str_remove(file_names, ".*/"), # remove the path file_names =str_remove(file_names, ".csv"), avg_cite =map_dbl(data, ~mean(.$Cites, na.rm =TRUE))#<)%>%unnest(data)
# A tibble: 1,978 × 28
file_names Cites Authors Title Year Source Publisher ArticleURL CitesURL
<chr> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
1 daily-inflati… 3295 GP Com… The … 2011 Quart… Wiley On… https://r… https:/…
2 daily-inflati… 885 E Naka… High… 2018 The Q… academic… https://a… https:/…
3 daily-inflati… 2708 E Cast… Synt… 2008 Synth… degruyte… https://w… https:/…
4 daily-inflati… 660 RC Cor… An e… 2018 Journ… Wiley On… https://a… https:/…
5 daily-inflati… 164 MC Med… Fore… 2021 Journ… Taylor &… https://w… https:/…
6 daily-inflati… 5219 GW Sch… Why … 1989 The j… Wiley On… https://o… https:/…
7 daily-inflati… 1701 PR Han… The … 2011 Econo… Wiley On… https://o… https:/…
8 daily-inflati… 2573 LJ Chr… The … 2003 inter… Wiley On… https://o… https:/…
9 daily-inflati… 4813 F Black Noise 1986 The j… Wiley On… https://o… https:/…
10 daily-inflati… 220 C Bind… Coro… 2020 Revie… direct.m… https://d… https:/…
# ℹ 1,968 more rows
# ℹ 19 more variables: GSRank <dbl>, QueryDate <dttm>, Type <chr>, DOI <chr>,
# ISSN <lgl>, CitationURL <lgl>, Volume <lgl>, Issue <lgl>, StartPage <lgl>,
# EndPage <lgl>, ECC <dbl>, CitesPerYear <dbl>, CitesPerAuthor <dbl>,
# AuthorCount <dbl>, Age <dbl>, Abstract <chr>, FullTextURL <chr>,
# RelatedURL <chr>, avg_cite <dbl>
Note
You may realize that the original unnested columns are copied to each corresponding observation.
We can simply use the nest function if we want to achieve the opposite.
map_dfr(file_names, read_csv, .id ="keyword")%>%mutate( keyword =file_names[as.numeric(keyword)], keyword =str_remove(keyword, ".*/"), # remove the path keyword =str_remove(keyword, ".csv")# remove extension)%>%nest( data =-keyword, # everything except "keyword" to the "data" column .by =keyword)
Lets open the url of the 5 most cited articles by the 2 topics, which is newer than 10 years, and the abstarct is about the US.
Tip
The walk function works similarly like map, but it does not return any value, it is useful if you want to generate side-effects (like opening something in your browser, with the browseURL).
Let us create a simulation to determine the optimal investment ratio (\(f\)) given a probability, (\(p\)), of doubling our invested money and a probability of \(1-p\) of losing it. We will play this game for a total of 200 rounds. What should be the value of \(f\), given a specific value of \(p\), in order to achieve maximum return?