使用Python(Pyspark)从S3服务器(ECS)读取和写入

时间:2019-11-25 12:30:40

标签: python apache-spark amazon-s3 pyspark amazon-ecs

我需要从作为ECS的S3服务器读取固定宽度的文件,然后将其转换为CSV写回S3服务器。

我正在尝试通过此链接https://pypi.org/project/smart-open/使用smart_open库,但是我的代码给出了错误

File "create_csv_ecs.py", line 11, in <module>
from smart_open import open
ImportError: No module named smart_open

但是,当我不使用smart_open时,它说路径不存在。

当我在本地运行它时,我的代码就像下面的代码一样工作,但是尝试了SPARK服务器,然后尝试从S3读取文件时,它给出了错误

from pyspark import SparkContext
from pyspark import SparkConf

from smart_open import open
import csv
import re
import sys

conf = SparkConf().setAppName("Convert CSV - Python")
sc = SparkContext(conf=conf)


print ("Hello Spark")

sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "access_key")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "sceret_key")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "end_point")

fname_in = 's3a://bucket/test_file.txt'
fname_out = 's3a://bucket/test_file_output.txt'

cols = [(0, 1), (1, 6), (6, 9), (9, 10), (10, 15), (15, 25), (25, 28), (28, 38)]

with open(fname_in) as fin, open(fname_out, 'wt') as fout:

    reader = csv.reader(fin)
    header = next(reader)  # read header
    print(header)
    acc_period = list()
    o = [num for num in re.findall('\d*', str(header)) if num]
    date = sorted(o)[0]
    print(o)
    print(date)
    year, month, day = date[:4], date[4:6], date[6:]
    acc_period.extend(('{0:03d}'.format(int(month)), '{0:03d}'.format(int(month) - 1)))
    print(acc_period)

    writer = csv.writer(fout, delimiter=",", lineterminator="\n")

    writer.writerow(["FLAG", "CODE", "SOURCE", "STATUS", "UNIT", "ID"
                 , "LINE", "GROUP"])

    for line in fin:
        line = line.rstrip()  # removing the '\n' and other trailing whitespaces
        row = []  # init -- empty list
        data = [line[c[0]:c[1]] for c in cols]
        # print("data:",data[16])
        # Excluding rows where account not start with 4 or 5 and (int(str(data[11])) in acc_period)
        writer.writerow(data)
sc.stop()
sys.exit()

我的SPARK服务器版本为2.3.0

示例文件如下所示

HGLOABCD8PSGL_ZXFH J20190603NXT_APAC
D30056747PRD0091921170811405ACTUAL    ACTUAL    6222020190110001508014
D30056747PRD0091921170811405ACTUAL    ACTUAL    6222020190110001508014

0 个答案:

没有答案
相关问题