Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。当然在百度百科上这种方法在Nutch1.2之后,已经不再适合这样描述Nutch了,因为在1.2版本之后,Nutch专注的只是爬取数据,而全文检索的部分彻底
Nutch 是一个开源Java 实现的搜索引擎。它提供了我们运行自己的搜索引擎所需的全部工具。包括全文搜索和Web爬虫。当然在百度百科上这种方法在Nutch1.2之后,已经不再适合这样描述Nutch了,因为在1.2版本之后,Nutch专注的只是爬取数据,而全文检索的部分彻底的交给Lucene和Solr,ES来做了,当然因为他们都是近亲关系,所以Nutch抓取完后的数据,非常easy的就能生成全文索引。序号 | 名称 | 职责描述 |
1 | Nutch1.8 | 主要负责爬取数据,支持分布式 |
2 | Hadoop1.2.0 | 使用MapReduce进行并行爬取,使用HDFS存储数据,Nutch的任务提交在Hadoop集群上,支持分布式 |
3 | Solr4.3.1 | 主要负责检索,对爬完后的数据进行搜索,查询,海量数据支持分布式 |
4 | IK4.3 | 主要负责,对网页内容与标题进行分词,便于全文检索 |
5 | Centos6.5 | Linux系统,在上面运行nutch,hadoop等应用 |
6 | Tomcat7.0 | 应用服务器,给Solr提供容器运行 |
7 | JDK1.7 | 提供JAVA运行环境 |
8 | Ant1.9 | 提供Nutch等源码编译 |
9 | 屌丝软件工程师一名 | 主角 |
http.agent.name mynutch http.robots.agents mynutch,* The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* plugin.folders ./src/plugin plugins Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.
export HADOOP_HOME=/root/hadoop1.2 export PATH=$HADOOP_HOME/bin:$PATH ANT_HOME=/root/apache-ant-1.9.2 export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL export JAVA_HOME=/root/jdk1.7 export PATH=$JAVA_HOME/bin:$ANT_HOME/bin:$PATH export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
[root@master bin]# which hadoop /root/hadoop1.2/bin/hadoop [root@master bin]#
id content
java.lang.Exception: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:426) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 11 more Caused by: java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) ... 16 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) ... 19 more Caused by: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:123) at org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:74) ... 24 more 2013-09-05 20:40:49,329 INFO mapred.JobClient (JobClient.java:monitorAndPrintJob(1393)) - map 0% reduce 0% 2013-09-05 20:40:49,332 INFO mapred.JobClient (JobClient.java:monitorAndPrintJob(1448)) - Job complete: job_local1315110785_0001 2013-09-05 20:40:49,332 INFO mapred.JobClient (Counters.java:log(585)) - Counters: 0 2013-09-05 20:40:49,333 INFO mapred.JobClient (JobClient.java:runJob(1356)) - Job Failed: NA Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.crawl.Injector.inject(Injector.java:281) at org.apache.nutch.crawl.Crawl.run(Crawl.java:132) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) ========================================================================== 解决方法:在nutch-site.xml里面加入如下配置。 plugin.folders ./src/plugin Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.
./crawl urls mydir http://192.168.211.36:9001/solr/ 2