当前位置:  开发笔记 > 编程语言 > 正文

Javascript可以读取任何网页的来源吗?

如何解决《Javascript可以读取任何网页的来源吗?》经验,为你挑选了4个好方法。

我正在进行屏幕抓取,并希望检索特定页面的源代码.

如何用javascript实现这一目标?请帮我.



1> Cherian..:

简单的方法开始,尝试jQuery

$("#links").load("/Main_Page #jq-p-Getting-Started li");

更多关于jQuery Docs

另一种以更加结构化的方式进行屏幕抓取的方法是使用YQL或Yahoo Query Language.它将返回结构化为JSON或xml的抓取数据.
例如,
让我们刮掉stackoverflow.com

select * from html where url="http://stackoverflow.com"

会给你一个JSON数组(我选择了那个选项)

 "results": {
   "body": {
    "noscript": [
     {
      "div": {
       "id": "noscript-padding"
      }
     },
     {
      "div": {
       "id": "noscript-warning",
       "p": "Stack Overflow works best with JavaScript enabled"
      }
     }
    ],
    "div": [
     {
      "id": "notify-container"
     },
     {
      "div": [
       {
        "id": "header",
        "div": [
         {
          "id": "hlogo",
          "a": {
           "href": "/",
           "img": {
            "alt": "logo homepage",
            "height": "70",
            "src": "http://i.stackoverflow.com/Content/Img/stackoverflow-logo-250.png",
            "width": "250"
           }
……..

这样做的好处在于,您可以进行投影以及哪些条款最终可以为您提供所需的数据,而且只有您需要的数据(最终可以通过线路获得更少的带宽),
例如

select * from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

会得到你

 "results": {
   "a": [
    {
     "href": "/questions/414690/iphone-simulator-port-for-windows-closed",
     "title": "Duplicate: Is any Windows simulator available to test iPhone application? as a hobbyist who cannot afford a mac, i set up a toolchain kit locally on cygwin to compile objecti … ",
     "content": "iphone\n                simulator port for windows [closed]"
    },
    {
     "href": "/questions/680867/how-to-redirect-the-web-page-in-flex-application",
     "title": "I have a button control ....i need another web page to be redirected while clicking that button .... how to do that ? Thanks ",
     "content": "How\n                to redirect the web page in flex application ?"
    },
…..

现在只收到我们提出的问题

select title from html where url="http://stackoverflow.com" and
      xpath='//div/h3/a'

注意投影中的标题

 "results": {
   "a": [
    {
     "title": "I don't want the function to be entered simultaneously by multiple threads, neither do I want it to be entered again when it has not returned yet. Is there any approach to achieve … "
    },
    {
     "title": "I'm certain I'm doing something really obviously stupid, but I've been trying to figure it out for a few hours now and nothing is jumping out at me. I'm using a ModelForm so I can … "
    },
    {
     "title": "when i am going through my project in IE only its showing errors A runtime error has occurred Do you wish to debug? Line 768 Error:Expected')' Is this is regarding any script er … "
    },
    {
     "title": "I have a java batch file consisting of 4 execution steps written for analyzing any Java application. In one of the steps, I'm adding few libs in classpath that are needed for my co … "
    },
    {
……

编写查询后,它会为您生成一个URL

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20% 20%20%20%20%20xpath%3D '%2F%2Fdiv%2Fh3%2FA' %0A%20%20%20%20&格式= JSON&回调= cbfunc

在我们的例子中.

所以最终你最终会做这样的事情

var titleList = $.getJSON(theAboveUrl);

并与它一起玩.

美丽,不是吗?


非常棒,尤其是暗示雅虎的穷人解决方案,无需使用代理来获取数据.谢谢!!我冒昧地将最后一个demo-link修复为query.yahooapis.com:它在url-encoding中缺少一个%符号.很酷,这仍然有效!!

2> karim79..:

可以使用Javascript,只要您通过域名代理抓取任何页面:









为什么需要基于域的代理?
因为同源政策

3> Cerebrus..:

您只需使用XmlHttp(AJAX)命中所需的URL,即可在该responseText属性中使用URL中的HTML响应.如果它不是同一个域,您的用户将收到一个浏览器提醒,上面写着"此页面正在尝试访问其他域名.您要允许此操作吗?"


不幸的是,您不会收到任何警报,只会阻止该请求

4> nickf..:

作为安全措施,Javascript无法读取来自不同域的文件.虽然可能有一些奇怪的解决方法,但我会考虑使用不同的语言来完成这项任务.

推荐阅读
有风吹过best
这个屌丝很懒,什么也没留下!
DevBox开发工具箱 | 专业的在线开发工具网站    京公网安备 11010802040832号  |  京ICP备19059560号-6
Copyright © 1998 - 2020 DevBox.CN. All Rights Reserved devBox.cn 开发工具箱 版权所有