crawler4j
尊重履带式政治,如robots.txt
.在你的情况下,该文件是以下一个.
检查此文件显示,不允许抓取您给定的种子点:
Disallow: /ShowRatings.jsp Disallow: /campusRatings.jsp
crawler4j
日志输出支持该理论:
2015-12-15 19:47:18,791 WARN [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222 2015-12-15 19:47:18,793 WARN [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044