我已经构建了一个功能,它可以获取网址并在抓取网页后返回所需的结果.功能如下:
library(httr) library(curl) library(rvest) library(dplyr) sd_cat <- function(url){ cat <- curl(url, handle = new_handle("useragent" = "myua")) %>% read_html() %>% html_nodes("#breadCrumbWrapper") %>% html_text() x <- cat[1] #y <- gsub(pattern = "\n", x=x, replacement = " ") y <- gsub(pattern = "\t", x=x, replacement = " ") y <- gsub("\\d|,|\t", x=y, replacement = "") y <- gsub("^ *|(?<= ) | *$", "", y, perl=T) z <- gsub("\n*{2,}","",y) z <- gsub(" {2,}",">",z) final <- substring(z,2) final <- substring(final,1,nchar(final)-1) final #sample discontinued url: "http://www.snapdeal.com//product/givenchy-xeryus-rouge-g-edt/1978028261" #sample working url: "http://www.snapdeal.com//product/davidoff-cool-water-game-100ml/1339014133" }
这个函数在包含多个url的字符向量上使用sapply工作正常,但如果停止使用单个url,则该函数抛出
open.connection(x,"rb")出错:HTTP错误404.
我需要一种方法来跳过已停止的URL以使该功能正常工作.
更好的解决方案是使用httr并在响应不正确时故意采取措施:
library(httr) library(rvest) sd_cat <- function(url){ r <- GET(url, user_agent("myua")) if (status_code(r) >= 300) return(NA_character_) r %>% read_html() %>% html_nodes("#breadCrumbWrapper") %>% .[[1]] %>% html_nodes("span") %>% html_text() } sd_cat("http://www.snapdeal.com//product/givenchy-xeryus-rouge-g-edt/1978028261") sd_cat("http://www.snapdeal.com//product/davidoff-cool-water-game-100ml/1339014133")
(我也用更好的rvest替换你的正则表达式)