首页 文章

将PySpark与Jupyter Notebook集成

提问于
浏览
2

我正在关注这个site以安装Jupyter Notebook,PySpark并集成两者 .

当我需要创建“Jupyter配置文件”时,我读到“Jupyter配置文件”不再存在 . 所以我继续执行以下几行 .

$ mkdir -p ~/.ipython/kernels/pyspark

$ touch ~/.ipython/kernels/pyspark/kernel.json

我打开了 kernel.json 并写了以下内容:

{
 "display_name": "pySpark",
 "language": "python",
 "argv": [
  "/usr/bin/python",
  "-m",
  "IPython.kernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7",
  "PYTHONPATH": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python:/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip",
  "PYTHONSTARTUP": "/usr/local/Cellar/spark-2.0.0-bin-hadoop2.7/python/pyspark/shell.py",
  "PYSPARK_SUBMIT_ARGS": "pyspark-shell"
 }
}

Spark的路径是正确的 .

但是,当我运行 jupyter console --kernel pyspark 时,我得到了这个输出:

MacBook:~ Agus$ jupyter console --kernel pyspark
/usr/bin/python: No module named IPython
Traceback (most recent call last):
  File "/usr/local/bin/jupyter-console", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/site-packages/jupyter_core/application.py", line 267, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 595, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-113>", line 2, in initialize
  File "/usr/local/lib/python2.7/site-packages/traitlets/config/application.py", line 74, in catch_config_error
    return method(app, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 137, in initialize
    self.init_shell()
  File "/usr/local/lib/python2.7/site-packages/jupyter_console/app.py", line 110, in init_shell
    client=self.kernel_client,
  File "/usr/local/lib/python2.7/site-packages/traitlets/config/configurable.py", line 412, in instance
    inst = cls(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 251, in __init__
    self.init_kernel_info()
  File "/usr/local/lib/python2.7/site-packages/jupyter_console/ptshell.py", line 305, in init_kernel_info
    raise RuntimeError("Kernel didn't respond to kernel_info_request")
RuntimeError: Kernel didn't respond to kernel_info_request

2 回答

  • 4

    最简单的方法是使用findspark . 首先创建一个环境变量:

    export SPARK_HOME="{full path to Spark}"
    

    然后安装findspark:

    pip install findspark
    

    然后启动jupyter笔记本,以下应该工作:

    import findspark
    findspark.init()
    
    import pyspark
    
  • 11

    将pyspark与jupyter笔记本集成的许多方法 .

    1. Install Apache Toree .
    pip install jupyter
      pip install toree
      jupyter toree install --spark_home=path/to/your/spark_directory --interpreters=PySpark
    

    你可以通过检查安装

    jupyter kernelspec list
    

    你将获得toree pyspark内核的条目

    apache_toree_pyspark    /home/pauli/.local/share/jupyter/kernels/apache_toree_pyspark
    

    之后如果需要,可以安装其他解压缩程序,如SparkR,Scala,SQL

    jupyter toree install --interpreters=Scala,SparkR,SQL
    
    1. Add these lines to bashrc
    export SPARK_HOME=/path to /spark-2.2.0
      export PATH="$PATH:$SPARK_HOME/bin"    
      export PYSPARK_DRIVER_PYTHON=jupyter
      export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
    

    在终端输入 pyspark ,它将打开一个初始化了sparkcontext的jupyter笔记本 .

    • Install pyspark only as a python package
      pip install pyspark

    现在你可以像另一个python包一样导入pyspark .

相关问题