在Google Colab上安装spark时出现错误。它说
tar:spark-2.2.1-bin-hadoop2.7.tgz:无法打开:没有此类文件或目录tar: 错误无法恢复:请立即退出。
这是我的脚步
答案 0 :(得分:3)
此错误与您在代码第二行中使用的链接有关。 以下代码段对我适用于Google Colab。 不要忘记将spark版本更改为最新版本,并相应地将SPARK-HOME路径更改。 您可以在这里找到最新版本: https://downloads.apache.org/spark/
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop2.7.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop2.7.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop2.7"
import findspark
findspark.init()
答案 1 :(得分:2)
#for the most recent update on 02/29/2020
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop3.2.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop3.2
答案 2 :(得分:2)
问题是由于您用于下载spark的下载链接导致的:
http://apache.osuosl.org/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz
要毫无问题地下载spark,请从其存档站点(https://archive.apache.org/dist/spark
)下载它:
例如,以下归档文件中的下载链接正常工作
https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
这是安装和设置java,spark和pyspark的完整代码:
# innstall java
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# install spark (change the version number if needed)
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
# unzip the spark file to the current folder
!tar xf spark-3.0.0-bin-hadoop3.2.tgz
# set your spark folder to your system path environment.
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"
# install findspark using pip
!pip install -q findspark
对于python用户,还应该使用以下命令安装pyspark
。
!pip install pyspark
答案 3 :(得分:1)
您正在使用旧版本的链接,以下命令将起作用(新版本)
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark
答案 4 :(得分:1)
要在Colab中运行spark,首先我们需要在Colab环境中安装所有依赖项,例如具有hadoop 2.7的Apache Spark 2.3.2,Java 8和Findspark,以便在系统中定位spark。工具的安装可以在Colab的Jupyter笔记本中进行。
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark
如果再次出现此错误:无法打开:没有此类文件或目录tar
访问Apache Spark网站并获取最新的构建版本: 1. https://www-us.apache.org/dist/spark/ 2. http://apache.osuosl.org/spark/
用最新版本替换spark- 2.4.3 粗体字。
答案 5 :(得分:1)
这是正确的代码。我刚刚测试过。
select t2.id, t2.category, t2.product, t2.desc from (
select id, category, product,
case when (select count(1) from sales where id=t1.id group by id) as ct
,desc
from sales t1) t2 where t2.ct = 1
答案 6 :(得分:1)
我尝试了以下命令,它似乎可以正常工作。
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop3.2.tgz
!pip install -q findspark
我获得了最新版本,更改了下载URL,并在tar命令中添加了v
标志,以进行详细输出。
答案 7 :(得分:0)
只需转到https://downloads.apache.org/spark/,然后从文件夹中选择所需的版本,然后按照https://colab.research.google.com/github/asifahmed90/pyspark-ML-in-Colab/blob/master/PySpark_Regression_Analysis.ipynb#scrollTo=m606eNuQgA82中的说明进行操作
步骤:
列表项
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget https://downloads.apache.org/spark/FOLDER_YOU_CHOSE/FILE_YOU_CHOSE
!tar -xvf FILE_YOU_CHOSE
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/FILE_YOU_CHOSE"
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()