我使用scrapy 1.0.3并且无法发现如何使用CLOSESPIDER extesnion.对于命令:scrapy crawl domain_links --set = CLOSESPIDER_PAGECOUNT = 1是正确的一个请求,但对于两个页面计数:scrapy crawl domain_links --set CLOSESPIDER_PAGECOUNT = 2是无限的请求.
所以请在简单的例子中解释它是如何工作的.
这是我的蜘蛛代码:
class DomainLinksSpider(CrawlSpider): name = "domain_links" #allowed_domains = ["www.example.org"] start_urls = [ "www.example.org/",] rules = ( # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(LinkExtractor(allow_domains="www.example.org"), callback='parse_page'), ) def parse_page(self, response): print '<<<',response.url items = [] item = PathsSpiderItem() selected_links = response.selector.xpath('//a[@href]') for link in LinkExtractor(allow_domains="www.example.org", unique=True).extract_links(response): item = PathsSpiderItem() item['url'] = link.url items.append(item) return items
甚至不适合这个简单的蜘蛛:
# -*- coding: utf-8 -*- import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class ExampleSpider(CrawlSpider): name = 'example' allowed_domains = ['karen.pl'] start_urls = ['http://www.karen.pl'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(LinkExtractor(allow_domains="www.karen.pl"), callback='parse_item'), ) def parse_item(self, response): self.logger.info('Hi, this is an item page! %s', response.url) item = scrapy.Item() return item
但不是无限:
scrapy crawl example --set CLOSESPIDER_PAGECOUNT = 1'downloadader/request_count':1,
scrapy crawl example --set CLOSESPIDER_PAGECOUNT = 2'downloadader/request_count':17,
scrapy crawl example --set CLOSESPIDER_PAGECOUNT = 3'downloadader/request_count':19,
Maby是因为并行下载.是的,对于CONCURRENT_REQUESTS = 1,CLOSESPIDER_PAGECOUNT设置适用于第二个示例.我会检查第一个 - 它也有效.这对我来说几乎无限,因为有很多网址(我的项目)的网站地图被抓了下一页:)
CLOSESPIDER_PAGECOUNT
由CloseSpider
扩展控制,该扩展计算每个响应,直到达到其限制时,即它告诉爬行程序进程开始结束(完成请求并关闭可用插槽).
现在你的蜘蛛在你指定时结束的原因CLOSESPIDER_PAGECOUNT=1
是因为在那一刻(当它得到它的第一个响应时)没有待处理的请求,它们是在你的第一个之后创建的,所以爬虫程序就可以结束,而不是考虑到考虑以下因素(因为它们将在第一次出生后出生).
指定时CLOSESPIDER_PAGECOUNT>1
,捕获的蜘蛛会创建请求并填充请求队列.当蜘蛛知道何时完成时,仍有待处理的待处理请求,这些请求作为关闭蜘蛛的一部分执行.