如何在本地模式下运行的pyspark中读取S3?

时间:2018-05-04 22:36:52

标签: python apache-spark amazon-s3 pyspark

我正在使用PyCharm 2018.1使用Python 3.4和Spark 2.3通过pip在virtualenv中安装。本地主机上没有hadoop安装,因此没有Spark安装(因此没有SPARK_HOME,HADOOP_HOME等)

当我尝试这个时:

from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

我明白了:

py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
: java.io.IOException: No FileSystem for scheme: s3

如果在本地模式下运行pyspark而没有在本地安装完整的Hadoop,我如何从s3读取?

FWIW - 当我在非本地模式下在EMR节点上执行它时,这非常有用。

以下内容不起作用(同样的错误,虽然它确实解析并下载了依赖项):

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:3.1.0" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

相同(糟糕)的结果:

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars "/path/to/hadoop-aws-3.1.0.jar" pyspark-shell'
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()\
    .setMaster("local")\
    .setAppName("pyspark-unittests")\
    .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf = conf)
inputFile = sparkContext.textFile("s3://somebucket/file.csv")

3 个答案:

答案 0 :(得分:2)

在本地访问S3时,您应该使用s3a协议。确保首先将密钥和密钥添加到SparkContext。像这样:

sc = SparkContext(conf = conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

inputFile = sparkContext.textFile("s3a://somebucket/file.csv")

答案 1 :(得分:2)

所以Glennie的答案很接近但不是你的情况会有效。关键是要选择正确的依赖版本。如果你看一下虚拟环境

Jars

所有内容都指向一个版本2.7.3,您还需要使用

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'

您应该通过检查项目虚拟环境中的路径venv/Lib/site-packages/pyspark/jars来验证安装使用的版本

之后,您可以默认使用s3as3定义相同的处理程序类

# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")

print(s3File.count())
print(s3File.id())

输出低于

OutputSpark

答案 2 :(得分:0)

准备:

将以下行添加到您的spark配置文件中,对于我的本地pyspark,它是specDone: function(result) { console.log('Spec: ' + result.description + ' was ' + result.status); for(var i = 0; i < result.failedExpectations.length; i++) { console.log('Failure: ' + result.failedExpectations[i].message); console.log(result.failedExpectations[i].stack); } }

/usr/local/spark/conf/spark-default.conf

python文件内容:

spark.hadoop.fs.s3a.access.key=<your access key>
spark.hadoop.fs.s3a.secret.key=<your secret key>

提交:

from __future__ import print_function
import os

from pyspark import SparkConf
from pyspark import SparkContext

os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"


if __name__ == "__main__":

    conf = SparkConf().setAppName("read_s3").setMaster("local[2]")
    sc = SparkContext(conf=conf)

    my_s3_file3 = sc.textFile("s3a://store-test-1/test-file")
    print("file count:", my_s3_file3.count())