在Google Colab上安装Spark时出错

时间:2019-03-19 12:19:00

标签: apache-spark hadoop pyspark google-colaboratory

在Google Colab上安装spark时出现错误。它说

  

tar:spark-2.2.1-bin-hadoop2.7.tgz:无法打开:没有此类文件或目录tar:   错误无法恢复:请立即退出。

这是我的脚步

enter image description here

8 个答案:

答案 0 :(得分:3)

此错误与您在代码第二行中使用的链接有关。 以下代码段对我适用于Google Colab。 不要忘记将spark版本更改为最新版本,并相应地将SPARK-HOME路径更改。 您可以在这里找到最新版本: https://downloads.apache.org/spark/

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop2.7"
import findspark
findspark.init()

答案 1 :(得分:2)

#for the most recent update on 02/29/2020

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop3.2.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop3.2

答案 2 :(得分:2)

问题是由于您用于下载spark的下载链接导致的:

http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz

要毫无问题地下载spark,请从其存档站点(https://archive.apache.org/dist/spark)下载它:

例如,以下归档文件中的下载链接正常工作

https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

这是安装和设置java,spark和pyspark的完整代码:

# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz

# set your spark folder to your system path environment. 
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"


# install findspark using pip
!pip install -q findspark

对于python用户,还应该使用以下命令安装pyspark

!pip install pyspark

答案 3 :(得分:1)

您正在使用旧版本的链接,以下命令将起作用(新版本)

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark

答案 4 :(得分:1)

要在Colab中运行spark,首先我们需要在Colab环境中安装所有依赖项,例如具有hadoop 2.7的Apache Spark 2.3.2,Java 8和Findspark,以便在系统中定位spark。工具的安装可以在Colab的Jupyter笔记本中进行。

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark

如果再次出现此错误:无法打开:没有此类文件或目录tar

访问Apache Spark网站并获取最新的构建版本: 1. https://www-us.apache.org/dist/spark/ 2. http://apache.osuosl.org/spark/

用最新版本替换spark- 2.4.3 粗体字。

答案 5 :(得分:1)

这是正确的代码。我刚刚测试过。

select t2.id, t2.category, t2.product, t2.desc from (
    select id, category, product,
        case when (select count(1) from sales where id=t1.id group by id) as ct
        ,desc
    from sales t1) t2 where t2.ct = 1

答案 6 :(得分:1)

我尝试了以下命令,它似乎可以正常工作。

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop3.2.tgz
!pip install -q findspark

我获得了最新版本,更改了下载URL,并在tar命令中添加了v标志,以进行详细输出。

答案 7 :(得分:0)

只需转到https://downloads.apache.org/spark/,然后从文件夹中选择所需的版本,然后按照https://colab.research.google.com/github/asifahmed90/pyspark-ML-in-Colab/blob/master/PySpark_Regression_Analysis.ipynb#scrollTo=m606eNuQgA82中的说明进行操作

步骤:

  1. 转到https://downloads.apache.org/spark/
  2. 选择文件夹,例如:“ spark-3.0.1 /”
  3. 复制所需的文件名,例如:“ spark-3.0.1-bin-hadoop3.2.tgz”(以.tgz结尾)
  4. 粘贴到提供的脚本

列表项

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/FOLDER_YOU_CHOSE/FILE_YOU_CHOSE
!tar -xvf FILE_YOU_CHOSE
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/FILE_YOU_CHOSE"

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()