我需要从作为ECS的S3服务器读取固定宽度的文件,然后将其转换为CSV写回S3服务器。
我正在尝试通过此链接https://pypi.org/project/smart-open/使用smart_open库,但是我的代码给出了错误
File "create_csv_ecs.py", line 11, in <module>
from smart_open import open
ImportError: No module named smart_open
但是,当我不使用smart_open时,它说路径不存在。
当我在本地运行它时,我的代码就像下面的代码一样工作,但是尝试了SPARK服务器,然后尝试从S3读取文件时,它给出了错误
from pyspark import SparkContext
from pyspark import SparkConf
from smart_open import open
import csv
import re
import sys
conf = SparkConf().setAppName("Convert CSV - Python")
sc = SparkContext(conf=conf)
print ("Hello Spark")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "access_key")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "sceret_key")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "end_point")
fname_in = 's3a://bucket/test_file.txt'
fname_out = 's3a://bucket/test_file_output.txt'
cols = [(0, 1), (1, 6), (6, 9), (9, 10), (10, 15), (15, 25), (25, 28), (28, 38)]
with open(fname_in) as fin, open(fname_out, 'wt') as fout:
reader = csv.reader(fin)
header = next(reader) # read header
print(header)
acc_period = list()
o = [num for num in re.findall('\d*', str(header)) if num]
date = sorted(o)[0]
print(o)
print(date)
year, month, day = date[:4], date[4:6], date[6:]
acc_period.extend(('{0:03d}'.format(int(month)), '{0:03d}'.format(int(month) - 1)))
print(acc_period)
writer = csv.writer(fout, delimiter=",", lineterminator="\n")
writer.writerow(["FLAG", "CODE", "SOURCE", "STATUS", "UNIT", "ID"
, "LINE", "GROUP"])
for line in fin:
line = line.rstrip() # removing the '\n' and other trailing whitespaces
row = [] # init -- empty list
data = [line[c[0]:c[1]] for c in cols]
# print("data:",data[16])
# Excluding rows where account not start with 4 or 5 and (int(str(data[11])) in acc_period)
writer.writerow(data)
sc.stop()
sys.exit()
我的SPARK服务器版本为2.3.0
示例文件如下所示
HGLOABCD8PSGL_ZXFH J20190603NXT_APAC
D30056747PRD0091921170811405ACTUAL ACTUAL 6222020190110001508014
D30056747PRD0091921170811405ACTUAL ACTUAL 6222020190110001508014