我尝试使用 pyspark 从 HDFS 获取文件后修改我的文件,然后我想将其保存在HDFS中,因为我已经编写了以下代码。
代码:
WITH shipment_info AS (SELECT s.ship_id,
s.shipment_line_id,
first_value(ord.stcust ignore nulls) over(partition by shipment.ship_id
order by ord.ordnum desc range between unbounded preceding
and unbounded following) as shp_stcust,
first_value(ord.st_adr_id ignore nulls) over(partition by shipment.ship_id
order by ord.ordnum desc range between unbounded preceding
and unbounded following) as shp_st_adr_id,
first_value(ord.rtcust ignore nulls) over(partition by shipment.ship_id
order by ord.ordnum desc range between unbounded preceding
and unbounded following) as shp_rtcust,
first_value(ord.rt_adr_id ignore nulls) over(partition by shipment.ship_id
order by ord.ordnum desc range between unbounded preceding
and unbounded following) as shp_rt_adr_id,
first_value(ord.btcust ignore nulls) over(partition by shipment.ship_id
order by ord.ordnum desc range between unbounded preceding
and unbounded following) as shp_btcust,
first_value(ord.bt_adr_id ignore nulls) over(partition by shipment.ship_id
order by ord.ordnum desc range between unbounded preceding
and unbounded following) as shp_bt_adr_id,
first_value(ord.brcust ignore nulls) over(partition by shipment.ship_id
order by ord.ordnum desc range between unbounded preceding
and unbounded following) as shp_brcust,
first_value(ord.br_adr_id ignore nulls) over(partition by shipment.ship_id
order by ord.ordnum desc range between unbounded preceding
and unbounded following) as shp_br_adr_id
FROM shipment s
INNER JOIN shipment_line sl ON (s.ship_id = sl.ship_id)
INNER JOIN ord_line ol ON (sl.ordnum = ol.ordnum AND sl.ordlin = ol.ordlin AND sl.ordsln = ol.ordsln AND sl.wh_id = ol.wh_id AND sl.client_id = ol.client_id)
INNER JOIN ord o ON (ol.ordnum = o.ordnum AND ol.wh_id = o.wh_id AND ol.client_id = o.client_id))
SELECT shp.shp_stcust,
shp.shp_st_adr_id,
shp.shp_rtcust,
shp.shp_rt_adr_id,
shp_btcust,
shp_bt_adr_id,
shp_brcust,
shp_br_adr_id,
invdtl.dtlnum,
... -- I'm assuming other columns from the ctnmst and prtftp_dtl tables are in here somewhere, otherwise there'd no point in including them in the join conditions!
FROM invdtl
INNER JOIN invsub ON (invdtl.subnum = invsub.subnum)
INNER JOIN invlod ON (invsub.lodnum = invlod.lodnum)
LEFT JOIN ctnmst invlod_ctnmst ON (invlod.vc_ctncod = invlod_ctnmst.ctncod AND invlod.wh_id = invlod_ctnmst.wh_id)
LEFT JOIN ctnmst invsub_ctnmst ON (invsub.vc_ctncod = invsub_ctnmst.ctncod AND invsub.wh_id = invsub_ctnmst.wh_id)
LEFT JOIN prtftp_dtl prtftp ON (invdtl.prtnum = prtftp.prtnum AND invdtl.fgpcod = prtftp.ftpcod AND invlod.wh_id = prtftp.wh_id AND invdtl.prt_client_id = prtftp.prt_client_id AND prtftp.uomlvl = 0)
INNER JOIN shipment_info shp ON (invdtl.ship_line_id = shp.ship_line_id)
WHERE invdtl = 'D00000525035';
但是当我运行代码时,我遇到了错误。
错误:
import subprocess
from subprocess import Popen, PIPE
from pyspark import SparkContext
cat = sc.textFile("/user/root/parsed.txt")
hrk = "@"
for line in cat.collect():
if (code == "ID"):
line =line.strip() + "|"+hrk
line.saveAsTextFile("/user/root/testsprk")
print(line)
我知道我的行变量存在一些问题,但我无法修复它。
答案 0 :(得分:1)
因为您正在收集所有数据,这意味着集合不是RDD,但是普通列表和行只是一个字符串。
您不应该收集有关驱动程序的所有数据。相反,请使用RDD.map
然后RDD.saveAsTextFile
def add_hrk_on_id(line):
if (code == "ID"):
return line.strip() + "|"+hrk
else
return line
cat.map(add_hrk_on_id).saveAsTextFile(path)