Plenty of people have been scraping data from the web using R for a while now, but I just completed my first project and I wanted to share the code with you.  It was a little hard to work through some of the “issues”, but I had some great help from @DataJunkie on twitter.
As an aside, if you are learning R and coming from another package like SPSS or SAS, I highly advise that you follow the hashtag #rstats on Twitter to be amazed by the kinds of data analysis that are going on right now.
One note. When I read in my table, it contained a wierd set of characters. I suspect that it is some sort of encoding, but luckily, I was able to get around it by recoding the data from a character factor to a number by using the stringr package and some basic regex expressions.
Bring on fantasy football!
################################################################
## Help from the followingn sources:
## @DataJunkie on twitter
## http://www.regular-expressions.info/reference.html
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/2443127/how-can-i-use-r-rcurl-xml-packages-to-scrape-this-webpage
################################################################
library(XML)
library(stringr)
# build the URL
url <- paste("http://sports.yahoo.com/nfl/stats/byposition?pos=QB",
"&conference=NFL&year=season_2009",
"&timeframe=Week1", sep="")
# read the tables and select the one that has the most rows
tables <- readHTMLTable(url)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
# select the table we need - read as a dataframe
my.table <- tables[[7]]
# delete extra columns and keep data rows
View(head(my.table, n=20))
my.table <- my.table[3:nrow(my.table), c(1:3, 5:12, 14:18, 20:21, 23:24) ]
# rename every column
c.names <- c("Name", "Team", "G", "QBRat", "P_Comp", "P_Att", "P_Yds", "P_YpA", "P_Lng", "P_Int", "P_TD", "R_Att",
"R_Yds", "R_YpA", "R_Lng", "R_TD", "S_Sack", "S_SackYa", "F_Fum", "F_FumL")
names(my.table) <- c.names
# data get read in with wierd symbols - need to remove - initially stored as character factors
# for the loops, I am manually telling the code which regex to use - assumes constant behavior
# depending on where the wierd characters are -- is this an encoding?
front <- c(1)
back <- c(4:ncol(my.table))
for(f in front) {
test.front <- as.character(my.table[, f])
tt.front <- str_sub(test.front, start=3)
my.table[,f] <- tt.front
}
for(b in back) {
test <- as.character(my.table[ ,b])
tt.back <- as.numeric(str_match(test, "\-*\d{1,3}[\.]*[0-9]*"))
my.table[, b] <- tt.back
}
str(my.table)
View(my.table)
# clear memory and quit R
rm(list=ls())
q()
n
Source: http://www.r-bloggers.com/scrape-web-data-using-r/
As an aside, if you are learning R and coming from another package like SPSS or SAS, I highly advise that you follow the hashtag #rstats on Twitter to be amazed by the kinds of data analysis that are going on right now.
One note. When I read in my table, it contained a wierd set of characters. I suspect that it is some sort of encoding, but luckily, I was able to get around it by recoding the data from a character factor to a number by using the stringr package and some basic regex expressions.
Bring on fantasy football!
################################################################
## Help from the followingn sources:
## @DataJunkie on twitter
## http://www.regular-expressions.info/reference.html
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/1395528/scraping-html-tables-into-r-data-frames-using-the-xml-package
## http://stackoverflow.com/questions/2443127/how-can-i-use-r-rcurl-xml-packages-to-scrape-this-webpage
################################################################
library(XML)
library(stringr)
# build the URL
url <- paste("http://sports.yahoo.com/nfl/stats/byposition?pos=QB",
"&conference=NFL&year=season_2009",
"&timeframe=Week1", sep="")
# read the tables and select the one that has the most rows
tables <- readHTMLTable(url)
n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
tables[[which.max(n.rows)]]
# select the table we need - read as a dataframe
my.table <- tables[[7]]
# delete extra columns and keep data rows
View(head(my.table, n=20))
my.table <- my.table[3:nrow(my.table), c(1:3, 5:12, 14:18, 20:21, 23:24) ]
# rename every column
c.names <- c("Name", "Team", "G", "QBRat", "P_Comp", "P_Att", "P_Yds", "P_YpA", "P_Lng", "P_Int", "P_TD", "R_Att",
"R_Yds", "R_YpA", "R_Lng", "R_TD", "S_Sack", "S_SackYa", "F_Fum", "F_FumL")
names(my.table) <- c.names
# data get read in with wierd symbols - need to remove - initially stored as character factors
# for the loops, I am manually telling the code which regex to use - assumes constant behavior
# depending on where the wierd characters are -- is this an encoding?
front <- c(1)
back <- c(4:ncol(my.table))
for(f in front) {
test.front <- as.character(my.table[, f])
tt.front <- str_sub(test.front, start=3)
my.table[,f] <- tt.front
}
for(b in back) {
test <- as.character(my.table[ ,b])
tt.back <- as.numeric(str_match(test, "\-*\d{1,3}[\.]*[0-9]*"))
my.table[, b] <- tt.back
}
str(my.table)
View(my.table)
# clear memory and quit R
rm(list=ls())
q()
n
Source: http://www.r-bloggers.com/scrape-web-data-using-r/
 
No comments:
Post a Comment