R thing of the day: scrape file links and download each file
Last updated: Mar 9, 2022
I created a data package in R to accompany what I believe is an absolutely wonderful biostatistics text book, Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models.1 The package is available over here on GitHub. When I first started going over the material from the book and converting their analyses from Stata to R, I was unable to access the datasets. I made do by finding the datasets on someone’s public GitHub repo. Yet, yesterday I was able to find that they are available here. Since I frequently review the book, I figured I may as well just turn the data into an R package 📦.
Below is the code I put together to scrape the file links from the book’s site. I then used a tryCatch
statement to safely download each .dta
file. The tryCatch
statement proved useful as downloading one of the datasets (whickham.dta
) threw a 404 error.
library(rvest)
library(crayon)
site <- read_html("https://regression.ucsf.edu/second-edition/data-examples-and-problems")
links <- site %>%
html_elements("#node-1 a") %>%
html_attr("href")
# Return all text ending with `dta`
dta_files <- grep("dta$", links, value=TRUE)
dta_links <- paste0("https://regression.ucsf.edu", dta_files)
for (file in dta_links) {
tryCatch(
cat(green("Downloading " %+% basename(file) %+% " \n"))
download.file(file, destfile = here::here('data', basename(file))),
error = function(e){
message(paste("Error occurred trying to download ", basename(file)))
}
)
}
-
Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE. Regression Methods in Biostatistics. Springer US; 2012. doi:10.1007/978-1-4614-1353-0 ↩︎