已经过了几天,但我无法使用Spark从公共Amazon Bucket下载:(
这是spark-shell
命令:
spark-shell --master yarn -v --jars file:/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar,file:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar --driver-class-path=/usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar:/usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar
应用程序启动,shell等待提示:
____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191) Type in expressions to have them evaluated. Type :help for more information. scala> val data1 = sc.textFile("s3a://my-bucket-name/README.md") 18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 242.1 KB, free 246.7 MB) 18/12/25 13:06:40 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.2 KB, free 246.6 MB) 18/12/25 13:06:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop-edge01:3545 (size: 24.2 KB, free: 246.9 MB) 18/12/25 13:06:40 INFO SparkContext: Created broadcast 0 from textFile at:24 data1: org.apache.spark.rdd.RDD[String] = s3a://my-bucket-name/README.md MapPartitionsRDD[1] at textFile at :24 scala> data1.count() java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:97) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:206) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD.count(RDD.scala:1168) ... 49 elided Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.StorageStatistics at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 77 more scala>
如下所述,在hadoop / core-site.xml中设置了所有AWS密钥,秘密密钥:Hadoop-AWS模块:与Amazon Web Services集成
该存储桶是公共的-任何人都可以下载(已通过curl -O测试)
如您所见,所有.jars由Hadoop本身从/usr/local/hadoop/share/hadoop/tools/lib/
文件夹提供
没有其他设置spark-defaults.conf
-只有命令行中发送的设置
这两个罐子都不提供此类:
jar tf /usr/local/hadoop/share/hadoop/tools/lib/hadoop-aws-2.9.2.jar | grep org/apache/hadoop/fs/StorageStatistics (no result) jar tf /usr/local/hadoop/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.199.jar | grep org/apache/hadoop/fs/StorageStatistics (no result)
我该怎么办 ?我忘了加另一个罐子吗?什么确切的配置hadoop-aws
和aws-java-sdk-bundle
?版本?
嗯....终于找到问题了..
主要问题是我已经为Hadoop预安装了Spark。它是“针对Hadoop 2.7及更高版本的v2.4.0预先构建”。正如您在上面看到的我为之奋斗时所说的那样,这有点误导标题。实际上Spark附带了不同版本的hadoop jar。/ usr / local / spark / jars /中的清单显示它具有:
hadoop-common-2.7.3.jar
hadoop-client-2.7.3.jar
....
它只是丢失了:hadoop-aws和aws-java-sdk。我在Maven仓库中进行了一点挖掘:hadoop-aws-v2.7.3及其依赖项:aws-java-sdk-v1.7.4和voila!下载了这些jar并将其作为参数发送到Spark。像这样:
spark-shell
--master yarn
-v
--jars文件:/home/aws-java-sdk-1.7.4.jar,文件:/home/hadoop-aws-2.7.3.jar
--driver-class-path = / home / aws-java-sdk-1.7.4.jar:/home/hadoop-aws-2.7.3.jar
做了工作!
我只是想知道为什么Hadoop中的所有jar(以及将它们作为参数发送到--jar和--driver-class-path)都没有赶上。Spark会以某种方式自动选择罐子,而不是我发送的罐子