我想用Spark SQL测试基本的东西.我想加载一个csv.文件,保存在我的笔记本电脑上,并在其上运行一些SQL查询.但不知何故,我无法使用sqlContext加载数据.我收到错误:
Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.
但是,我没有使用Hive.
我正在使用Windows 10并使用Anaconda安装了python.我为hadoop 2.6安装了Spark 2.0.2 prebuild.我使用iPython Notebook作为用户界面.
我的代码如下:
file = "C:/Andra/spark-2.0.2-bin-hadoop2.6/zip.csv" df = sqlContext\ .read \ .format("com.databricks.spark.csv")\ .option("header", "true")\ .option("inferschema", "true")\ .option("mode", "DROPMALFORMED")\ .load(file)
问题在于Spark SQL,因为我可以使用加载相同的文件
textFile=sc.textFile("C:/Andra/spark-2.0.2-bin-hadoop2.6/zip.csv")
如果我想从Spark SQL文档https://spark.apache.org/docs/latest/sql-programming-guide.html运行示例,我会收到同样的错误.
from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() df = spark.read.json("C:/Andra/spark-2.0.2-bin-hadoop2.6/examples/src/main/resources/people.json")
我的印象是我可以在不使用Hive的情况下使用Spark SQL,因为我使用的数据是在我的笔记本电脑上保存的.此外,上述相同的文档仅表明:
"Spark SQL的一个用途是执行SQL查询.Spark SQL 也可用于从现有的Hive安装中读取数据.有关如何配置此功能的更多信息,请参阅Hive Tables部分."
还有使用Hive创建spark会话的示例.如果使用配置单元是必需的,那么上面的那个将是无用的.
但是,我想配置Hive以查看是否可以解决问题.文档指南(https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables)说明
"通过在conf /中放置 hive-site.xml,core-site.xml (用于安全性配置)和hdfs-site.xml (用于HDFS配置)文件来完成Hive的配置."
但是,我找不到那些文件.
所以我的问题是:
我是否需要Hive才能使用Spark SQL?
如果没有,我该怎么做才能让Spark SQL正常工作?
如果是,我如何正确配置它是否可以找到所需的文件?
任何帮助表示赞赏!谢谢!
这是完整的错误声明:
--------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last)in () 1 file = "C:/Andra/spark-2.0.2-bin-hadoop2.6/zip.csv" ----> 2 df = sqlContext .read .format("com.databricks.spark.csv") .option("header", "true") .option("inferschema", "true") .option("mode", "DROPMALFORMED") .load(file) C:\Andra\spark-2.0.2-bin-hadoop2.6\python\pyspark\sql\readwriter.pyc in load(self, path, format, schema, **options) 145 self.options(**options) 146 if isinstance(path, basestring): --> 147 return self._df(self._jreader.load(path)) 148 elif path is not None: 149 if type(path) != list: C:\Andra\spark-2.0.2-bin-hadoop2.6\python\lib\py4j-0.10.3-src.zip\py4j\java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: C:\Andra\spark-2.0.2-bin-hadoop2.6\python\pyspark\sql\utils.pyc in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() C:\Andra\spark-2.0.2-bin-hadoop2.6\python\lib\py4j-0.10.3-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling o110.load. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) at org.apache.spark.sql.hive.client.HiveClientImpl. (HiveClientImpl.scala:189) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) at org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45) at org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50) at org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48) at org.apache.spark.sql.hive.HiveSessionState$$anon$1. (HiveSessionState.scala:63) at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63) at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient. (RetryingMetaStoreClient.java:86) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503) ... 33 more Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521) ... 39 more Caused by: java.lang.NullPointerException at org.apache.thrift.transport.TSocket.open(TSocket.java:170) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient. (HiveMetaStoreClient.java:236) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. (SessionHiveMetaStoreClient.java:74) ... 44 more
AEDWIP.. 7
我最近遇到了同样的问题.在我的情况下,我同时在我的本地计算机上运行两个python jupyter笔记本.第一台笔记本工作正常.第二个一直在扔可怕的
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
我不确定权限是如何工作的.它似乎是运行一些如何锁定本地元存储的第一个笔记本.理解为不能在两个不同的会话之间共享元存储.
也许有人知道如何启用多个笔记本?
安迪
我最近遇到了同样的问题.在我的情况下,我同时在我的本地计算机上运行两个python jupyter笔记本.第一台笔记本工作正常.第二个一直在扔可怕的
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
我不确定权限是如何工作的.它似乎是运行一些如何锁定本地元存储的第一个笔记本.理解为不能在两个不同的会话之间共享元存储.
也许有人知道如何启用多个笔记本?
安迪