我在尝试下载PDF时遇到了问题.
例如,如果我在考古数据服务上有PDF文档的DOI,它将解析到此着陆页, 其中包含嵌入链接到此pdf,但它真正重定向到此其他链接.
library(httr)
将处理解析DOI,我们可以使用登陆页面提取PDF格式的URL,library(XML)
但我一直坚持获取PDF本身.
如果我这样做:
download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")
然后我收到一个与http://archaeologydataservice.ac.uk/myads/相同的HTML文件
尝试使用R如何从需要cookie的SSL页面下载压缩文件的答案引导我:
library(httr) terms <- "http://archaeologydataservice.ac.uk/myads/copyrights" download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload" values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf") # Accept the terms on the form, # generating the appropriate cookies POST(terms, body = values) GET(download, query = values) # Actually download the file (this will take a while) resp <- GET(download, query = values) # write the content of the download to a binary file writeBin(content(resp, "raw"), "c:/temp/thefile.zip")
但是在POST
和GET
函数之后,我只是得到了与之相同的cookie页面的HTML download.file
:
> GET(download, query = values) Response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466] Date: 2016-01-06 00:35 Status: 200 Content-Type: text/html;charset=UTF-8 Size: 21 kB看看http://archaeologydataservice.ac.uk/about/Cookies看来这个网站的cookie情况很复杂.似乎这种cookie复杂性对于英国数据提供商来说并不罕见:使用RCurl或httr自动登录到R中的英国数据服务网站
如何使用R来浏览本网站上的cookie?
1> hrbrmstr..:你听到了对rOpenSci的请求!
这些页面之间存在大量的javascript,这使得尝试通过
httr
+ 解密有点烦人rvest
.试试RSelenium
.这适用于OS X 10.11.2,R 3.2.3和Firefox加载.library(RSelenium) # check if a sever is present, if not, get a server checkForServer() # get the server going startServer() dir.create("~/justcreateddir") setwd("~/justcreateddir") # we need PDFs to download instead of display in-browser prefs <- makeFirefoxProfile(list( `browser.download.folderList` = as.integer(2), `browser.download.dir` = getwd(), `pdfjs.disabled` = TRUE, `plugin.scan.plid.all` = FALSE, `plugin.scan.Acrobat` = "99.0", `browser.helperApps.neverAsk.saveToDisk` = 'application/pdf' )) # get a browser going dr <- remoteDriver$new(extraCapabilities=prefs) dr$open() # go to the page with the PDF dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755") # find the PDF link and "hit ENTER" pdf_elem <- dr$findElement(using="css selector", "a.dlb3") pdf_elem$sendKeysToElement(list("\uE007")) # find the ACCEPT button and "hit ENTER" # that will save the PDF to the default downloads directory accept_elem <- dr$findElement(using="css selector", "a[id$='agreeButton']") accept_elem$sendKeysToElement(list("\uE007"))现在等待下载完成.R控制台在下载时不会很忙,因此在下载完成之前很容易意外关闭会话.
# close the session dr$close()
好的,找到了如何让它在我的电脑上工作.我必须首先使用`java -jar selenium-server-standalone-2.48.0.jar`手动启动selenium独立服务器.然后我可以连接.