出于学术和性能的考虑,鉴于这种爬行递归式网页爬行功能(仅在给定域内进行爬网),使迭代运行的最佳方法是什么?目前运行时,当它完成时,python已经攀升到使用超过1GB的内存,而这在共享环境中运行是不可接受的.
def crawl(self, url): "Get all URLS from which to scrape categories." try: links = BeautifulSoup(urllib2.urlopen(url)).findAll(Crawler._match_tag) except urllib2.HTTPError: return for link in links: for attr in link.attrs: if Crawler._match_attr(attr): if Crawler._is_category(attr): pass elif attr[1] not in self._crawled: self._crawled.append(attr[1]) self.crawl(attr[1])
Mehrdad Afsh.. 12
使用BFS而不是递归爬行(DFS):http://en.wikipedia.org/wiki/Breadth_first_search
您可以使用外部存储解决方案(例如数据库)来获取BFS队列以释放RAM.
算法是:
//pseudocode: var urlsToVisit = new Queue(); // Could be a queue (BFS) or stack(DFS). (probably with a database backing or something). var visitedUrls = new Set(); // List of visited URLs. // initialization: urlsToVisit.Add( rootUrl ); while(urlsToVisit.Count > 0) { var nextUrl = urlsToVisit.FetchAndRemoveNextUrl(); var page = FetchPage(nextUrl); ProcessPage(page); visitedUrls.Add(nextUrl); var links = ParseLinks(page); foreach (var link in links) if (!visitedUrls.Contains(link)) urlsToVisit.Add(link); }
Ber.. 5
您可以将新URL抓取到队列中,而不是递归.然后运行直到队列为空而不递归.如果将队列放入文件中,则几乎不使用任何内存.
使用BFS而不是递归爬行(DFS):http://en.wikipedia.org/wiki/Breadth_first_search
您可以使用外部存储解决方案(例如数据库)来获取BFS队列以释放RAM.
算法是:
//pseudocode: var urlsToVisit = new Queue(); // Could be a queue (BFS) or stack(DFS). (probably with a database backing or something). var visitedUrls = new Set(); // List of visited URLs. // initialization: urlsToVisit.Add( rootUrl ); while(urlsToVisit.Count > 0) { var nextUrl = urlsToVisit.FetchAndRemoveNextUrl(); var page = FetchPage(nextUrl); ProcessPage(page); visitedUrls.Add(nextUrl); var links = ParseLinks(page); foreach (var link in links) if (!visitedUrls.Contains(link)) urlsToVisit.Add(link); }
您可以将新URL抓取到队列中,而不是递归.然后运行直到队列为空而不递归.如果将队列放入文件中,则几乎不使用任何内存.