据我了解,Scrapy是单线程但在网络端异步.我正在开发一些需要从项目管道中对外部资源进行API调用的东西.有没有办法在不阻塞管道的情况下发出HTTP请求并减慢Scrapy的爬行速度?
谢谢
您可以通过直接将请求安排到crawler.engine
via来完成crawler.engine.crawl(request, spider)
.但要做到这一点,你需要在管道中公开爬虫:
class MyPipeline(object): def __init__(self, crawler): self.crawler = crawler @classmethod def from_crawler(cls, crawler): return cls(crawler) def process_item(self, item, spider): if item['some_extra_field']: # check if we already did below return item url = 'some_url' req = scrapy.Request(url, self.parse_item, meta={'item':item}) self.crawler.engine.crawl(req, spider) raise DropItem() # we will get this item next time def parse_item(self, response): item = response.meta['item'] item['some_extra_field'] = '...' return item