Question

我尝试使用 pyspark 从 HDFS 获取文件后修改我的文件，然后我想将其保存在HDFS中，因为我已经编写了以下代码。

代码：

WITH shipment_info AS (SELECT s.ship_id,
                              s.shipment_line_id,
                              first_value(ord.stcust ignore nulls) over(partition by shipment.ship_id
                                                                         order by ord.ordnum desc range between unbounded preceding
                                                                           and unbounded following) as shp_stcust,
                              first_value(ord.st_adr_id ignore nulls) over(partition by shipment.ship_id
                                                                            order by ord.ordnum desc range between unbounded preceding
                                                                              and unbounded following) as shp_st_adr_id,
                              first_value(ord.rtcust ignore nulls) over(partition by shipment.ship_id
                                                                         order by ord.ordnum desc range between unbounded preceding
                                                                           and unbounded following) as shp_rtcust,
                              first_value(ord.rt_adr_id ignore nulls) over(partition by shipment.ship_id
                                                                            order by ord.ordnum desc range between unbounded preceding
                                                                              and unbounded following) as shp_rt_adr_id,
                              first_value(ord.btcust ignore nulls) over(partition by shipment.ship_id
                                                                         order by ord.ordnum desc range between unbounded preceding
                                                                           and unbounded following) as shp_btcust,
                              first_value(ord.bt_adr_id ignore nulls) over(partition by shipment.ship_id
                                                                            order by ord.ordnum desc range between unbounded preceding
                                                                              and unbounded following) as shp_bt_adr_id,
                              first_value(ord.brcust ignore nulls) over(partition by shipment.ship_id
                                                                         order by ord.ordnum desc range between unbounded preceding
                                                                           and unbounded following) as shp_brcust,
                              first_value(ord.br_adr_id ignore nulls) over(partition by shipment.ship_id
                                                                            order by ord.ordnum desc range between unbounded preceding
                                                                              and unbounded following) as shp_br_adr_id
                       FROM   shipment s
                              INNER JOIN shipment_line sl ON (s.ship_id = sl.ship_id)
                              INNER JOIN ord_line ol ON (sl.ordnum = ol.ordnum AND sl.ordlin = ol.ordlin AND sl.ordsln = ol.ordsln AND sl.wh_id = ol.wh_id AND sl.client_id = ol.client_id)
                              INNER JOIN ord o ON (ol.ordnum = o.ordnum AND ol.wh_id = o.wh_id AND ol.client_id = o.client_id))
SELECT shp.shp_stcust,
       shp.shp_st_adr_id,
       shp.shp_rtcust,
       shp.shp_rt_adr_id,
       shp_btcust,
       shp_bt_adr_id,
       shp_brcust,
       shp_br_adr_id,
       invdtl.dtlnum,
       ... -- I'm assuming other columns from the ctnmst and prtftp_dtl tables are in here somewhere, otherwise there'd no point in including them in the join conditions!
FROM   invdtl
       INNER JOIN invsub ON (invdtl.subnum = invsub.subnum)
       INNER JOIN invlod ON (invsub.lodnum = invlod.lodnum)
       LEFT JOIN ctnmst invlod_ctnmst ON (invlod.vc_ctncod = invlod_ctnmst.ctncod AND invlod.wh_id = invlod_ctnmst.wh_id)
       LEFT JOIN ctnmst invsub_ctnmst ON (invsub.vc_ctncod = invsub_ctnmst.ctncod AND invsub.wh_id = invsub_ctnmst.wh_id)
       LEFT JOIN prtftp_dtl prtftp ON (invdtl.prtnum = prtftp.prtnum AND invdtl.fgpcod = prtftp.ftpcod AND invlod.wh_id = prtftp.wh_id AND invdtl.prt_client_id = prtftp.prt_client_id AND prtftp.uomlvl = 0)
       INNER JOIN shipment_info shp ON (invdtl.ship_line_id = shp.ship_line_id)
WHERE  invdtl = 'D00000525035';

但是当我运行代码时，我遇到了错误。

错误：

import subprocess
from subprocess import Popen, PIPE
from pyspark import SparkContext
cat = sc.textFile("/user/root/parsed.txt")
hrk = "@"
for line in cat.collect():
   if (code == "ID"):
      line =line.strip() + "|"+hrk   
      line.saveAsTextFile("/user/root/testsprk")
      print(line)

我知道我的行变量存在一些问题，但我无法修复它。

Answer 1

因为您正在收集所有数据，这意味着集合不是RDD，但是普通列表和行只是一个字符串。

您不应该收集有关驱动程序的所有数据。相反，请使用RDD.map然后RDD.saveAsTextFile

def add_hrk_on_id(line):
    if (code == "ID"):
        return line.strip() + "|"+hrk   
    else
        return line

cat.map(add_hrk_on_id).saveAsTextFile(path)

＆＃39;的unicode＆＃39;对象没有属性＆＃39; saveAsTextFile＆＃39;

1 个答案: