我是apache spark的新手,显然我在我的macbook中用自制软件安装了apache-spark:
Last login: Fri Jan 8 12:52:04 on console user@MacBook-Pro-de-User-2:~$ pyspark Python 2.7.10 (default, Jul 13 2015, 12:05:58) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1 16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user 16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user 16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user) 16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started 16/01/08 14:46:50 INFO Remoting: Starting remoting 16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.1.64:50199] 16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199. 16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker 16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster 16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95 16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB 16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393 16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server 16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200. 16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator 16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/01/08 14:46:52 INFO SparkUI: Started SparkUI at http://192.168.1.64:4040 16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost 16/01/08 14:46:53 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50201. 16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201 16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager 16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201) 16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Python version 2.7.10 (default, Jul 13 2015 12:05:58) SparkContext available as sc, HiveContext available as sqlContext. >>>
我想开始玩,以了解有关MLlib的更多信息.但是,我使用Pycharm在python中编写脚本.问题是:当我去Pycharm并尝试调用pyspark时,Pycharm无法找到该模块.我尝试将路径添加到Pycharm,如下所示:
然后从博客我试过这个:
import os import sys # Path for spark source folder os.environ['SPARK_HOME']="/Users/user/Apps/spark-1.5.2-bin-hadoop2.4" # Append pyspark to Python Path sys.path.append("/Users/user/Apps/spark-1.5.2-bin-hadoop2.4/python/pyspark") try: from pyspark import SparkContext from pyspark import SparkConf print ("Successfully imported Spark Modules") except ImportError as e: print ("Can not import Spark Modules", e) sys.exit(1)
并且仍然无法开始使用Pycharm与Pycharm,任何想法如何"链接"PyCharm与apache-pyspark?
更新:
然后我搜索apache-spark和python路径以设置Pycharm的环境变量:
apache-spark路径:
user@MacBook-Pro-User-2:~$ brew info apache-spark apache-spark: stable 1.6.0, HEAD Engine for large-scale data processing https://spark.apache.org/ /usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) * Poured from bottle From: https://github.com/Homebrew/homebrew/blob/master/Library/Formula/apache-spark.rb
python路径:
user@MacBook-Pro-User-2:~$ brew info python python: stable 2.7.11 (bottled), HEAD Interpreted, interactive, object-oriented programming language https://www.python.org /usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) *
然后用上面的信息我试着设置环境变量如下:
知道如何正确链接Pycharm与pyspark?
然后,当我运行具有上述配置的python脚本时,我有以下异常:
/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/test_1.py Traceback (most recent call last): File "/Users/user/PycharmProjects/spark_examples/test_1.py", line 1, infrom pyspark import SparkContext ImportError: No module named pyspark
更新: 然后我尝试了@ zero323提出的配置
配置1:
/usr/local/Cellar/apache-spark/1.5.1/
出:
user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls CHANGES.txt NOTICE libexec/ INSTALL_RECEIPT.json README.md LICENSE bin/
配置2:
/usr/local/Cellar/apache-spark/1.5.1/libexec
出:
user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls R/ bin/ data/ examples/ python/ RELEASE conf/ ec2/ lib/ sbin/
zero323.. 101
随着SPARK-1267被合并,你应该能够通过简化流程pip
在您使用PyCharm发展环境中安装的火花.
转到文件 - > 设置 - > 项目解释器
单击"安装"按钮并搜索PySpark
单击"安装包"按钮.
创建运行配置:
转到运行 - > 编辑配置
添加新的Python配置
设置脚本路径,使其指向要执行的脚本
编辑环境变量字段,使其至少包含:
SPARK_HOME
- 它应该指向Spark安装目录.它应包含的目录,例如bin
(具有spark-submit
,spark-shell
等)和conf
(用spark-defaults.conf
,spark-env.sh
等)
PYTHONPATH
- 它应该包含$SPARK_HOME/python
并且可选地$SPARK_HOME/python/lib/py4j-some-version.src.zip
否则不可用.some-version
应匹配给定Spark安装使用的Py4J版本(0.8.2.1 - 1.5,0.9 - 1.6,0.10.3 - 2.0,0.10.4 - 2.1,0.10.4 - 2.2,0.10.6 - 2.3)
应用设置
将PySpark库添加到解释器路径(代码完成所需):
转到文件 - > 设置 - > 项目解释器
打开要与Spark一起使用的解释器的设置
编辑解释器路径,使其包含路径$SPARK_HOME/python
(如果需要,可以使用Py4J)
保存设置
安装或添加与安装的Spark版本匹配的路径类型注释,以获得更好的完成和静态错误检测(免责声明 - 我是项目的作者).
使用新创建的配置来运行脚本.
随着SPARK-1267被合并,你应该能够通过简化流程pip
在您使用PyCharm发展环境中安装的火花.
转到文件 - > 设置 - > 项目解释器
单击"安装"按钮并搜索PySpark
单击"安装包"按钮.
创建运行配置:
转到运行 - > 编辑配置
添加新的Python配置
设置脚本路径,使其指向要执行的脚本
编辑环境变量字段,使其至少包含:
SPARK_HOME
- 它应该指向Spark安装目录.它应包含的目录,例如bin
(具有spark-submit
,spark-shell
等)和conf
(用spark-defaults.conf
,spark-env.sh
等)
PYTHONPATH
- 它应该包含$SPARK_HOME/python
并且可选地$SPARK_HOME/python/lib/py4j-some-version.src.zip
否则不可用.some-version
应匹配给定Spark安装使用的Py4J版本(0.8.2.1 - 1.5,0.9 - 1.6,0.10.3 - 2.0,0.10.4 - 2.1,0.10.4 - 2.2,0.10.6 - 2.3)
应用设置
将PySpark库添加到解释器路径(代码完成所需):
转到文件 - > 设置 - > 项目解释器
打开要与Spark一起使用的解释器的设置
编辑解释器路径,使其包含路径$SPARK_HOME/python
(如果需要,可以使用Py4J)
保存设置
安装或添加与安装的Spark版本匹配的路径类型注释,以获得更好的完成和静态错误检测(免责声明 - 我是项目的作者).
使用新创建的配置来运行脚本.
这是我在mac osx上解决这个问题的方法.
brew install apache-spark
将其添加到〜/ .bash_profile
export SPARK_VERSION=`ls /usr/local/Cellar/apache-spark/ | sort | tail -1` export SPARK_HOME="/usr/local/Cellar/apache-spark/$SPARK_VERSION/libexec" export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
将pyspark和py4j添加到内容根目录(使用正确的Spark版本):
/usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/py4j-0.9-src.zip /usr/local/Cellar/apache-spark/1.6.1/libexec/python/lib/pyspark.zip
这是适合我的设置
设置智能感知:
单击文件 - >设置 - >项目: - >项目解释器
单击Project Interpreter下拉列表右侧的齿轮图标
从上下文菜单中单击"更多..."
选择解释器,然后单击"显示路径"图标(右下角)
单击+图标两个添加以下路径:
\ python的\ LIB\py4j-0.9-src.zip
\ BIN\python的\ LIB\pyspark.zip
单击确定,确定,确定
继续测试您的新intellisense功能.
在pycharm(windows)中配置pyspark
File menu - settings - project interpreter - (gearshape) - more - (treebelowfunnel) - (+) - [add python folder form spark installation and then py4j-*.zip] - click ok
确保在windows环境中设置SPARK_HOME,pycharm将从那里开始.确认 :
Run menu - edit configurations - environment variables - [...] - show
(可选)在环境变量中设置SPARK_CONF_DIR.