Question

我正在尝试对我的RDD进行一些转换，为此，我使用map调用函数。但是，不会调用此函数。有人请让我知道我在这里做错了什么？

我可以看到test函数被调用但不是store_past_info

def store_past_info(row):
    print "------------------- store_past_info  ------------------------------"

    if row["transactiontype"] == "Return":
        global prv_transaction_number
        prv_transaction_number = row["transnumber"]
        global return_occured
        return_occured = True
        global group_id
        group_id.append(row["transnumber"])

    if row["transactiontype"] == "Purchase":
            if return_occured:
                global group_id
                group_id.append(prv_transaction_number)
            else:
                global group_id
                group_id.append(row["transnumber"])

    print group_id


def test(rdd):
    print "------------------- test  ------------------------------"
    rdd.map(store_past_info).collect()
    print group_id

这就是它在店内的运作方式：

如果购买某件商品，则会生成一个ID。
如果您想从购买中退回少量商品，则会生成两个条目
1. 返回包含所有产品的新ID的条目，并org_id作为您要退货的采购订单的id
2. 新购买条目与您要保留的内容的最后一次购买ID相同id

输入

Date        Type        Id      org_id
25-03-2018  Purchase    111 
25-03-2018  Purchase    112 
26-03-2018  Return      113     111    
26-03-2018  Purchase    111

输出我想添加一个新的group_id列，它将返回返回后返回和相应购买的相同ID（客户不进行此次购买，这是系统每次返回时如何保留条目）步骤2.1

Date        Type        Id      org_id  group_id
25-03-2018  Purchase    111             111 
25-03-2018  Purchase    112             112
26-03-2018  Return      113     111     113
26-03-2018  Purchase    111             113

Answer 1

IIUC，我相信您可以使用DataFrame，pyspark.sql.Window函数和crossJoin()

获取输出结果

首先使用

将您的rdd转换为数据框架

df = rdd.toDF()  # you may have to specify the column names
df.show()
#+----------+--------+---+------+
#|      Date|    Type| Id|org_id|
#+----------+--------+---+------+
#|25-03-2018|Purchase|111|  null|
#|25-03-2018|Purchase|112|  null|
#|26-03-2018|  Return|113|   111|
#|26-03-2018|Purchase|111|  null|
#+----------+--------+---+------+

然后我们需要添加一个Index列来跟踪行的顺序。我们可以使用pyspark.sql.functions.monotonically_increasing_id()。这将保证值将增加（因此可以对它们进行排序），但并不意味着它们将是顺序的。

import pyspark.sql.functions as f
df = df.withColumn('Index', f.monotonically_increasing_id())
df.show()
#+----------+--------+---+------+-----------+
#|      Date|    Type| Id|org_id|      Index|
#+----------+--------+---+------+-----------+
#|25-03-2018|Purchase|111|  null| 8589934592|
#|25-03-2018|Purchase|112|  null|17179869184|
#|26-03-2018|  Return|113|   111|34359738368|
#|26-03-2018|Purchase|111|  null|42949672960|
#+----------+--------+---+------+-----------+

排序非常重要，因为您希望查找返回后的行。

接下来使用crossJoin将DataFrame加入到自身。

由于这会返回笛卡尔积，我们会将其过滤为符合以下条件的 的行：

l.Index = r.Index（基本上是连接一行）

(l.Id = r.org_id) AND (l.Index > r.Index)（Id等于前一行的org_id - 这是索引列有用的地方）

然后我们为group_id添加一列，如果满足第二个条件，则将其设置为r.Id。否则，我们将此列设置为None。

df1 = df.alias('l').crossJoin(df.alias('r'))\ .where('(l.Index = r.Index) OR ((l.Id = r.org_id) AND (l.Index > r.Index))')\ .select( 'l.Index', 'l.Date', 'l.Type', 'l.Id', 'l.org_id', f.when( (f.col('l.Id') == f.col('r.org_id')) & (f.col('l.Index') > f.col('r.Index')), f.col('r.Id') ).otherwise(f.lit(None)).alias('group_id') ) df1.show() #+-----------+----------+--------+---+------+--------+ #| Index| Date| Type| Id|org_id|group_id| #+-----------+----------+--------+---+------+--------+ #| 8589934592|25-03-2018|Purchase|111| null| null| #|17179869184|25-03-2018|Purchase|112| null| null| #|34359738368|26-03-2018| Return|113| 111| null| #|42949672960|26-03-2018|Purchase|111| null| 113| #|42949672960|26-03-2018|Purchase|111| null| null| #+-----------+----------+--------+---+------+--------+

我们几乎在那里，但正如你所看到的，仍有两件事情需要做。

我们需要删除Index = 42949672960
的重复行
我们需要使用group_id中的值为null填写Id行。

第一步，我们将使用Window函数创建一个名为rowNum的临时列。对于布尔条件Index排序的每个group_id IS NULL，这将是pyspark.sql.functions.row_number()。

对于有多行的索引值，已设置group_id的索引值将先排序。因此，我们只需要选择rowNum等于1的行（row_number()从1开始，而不是0）。

完成此操作后，第二步非常简单 - 只需使用null中的值替换剩余的Id值。

from pyspark.sql import Window w = Window.partitionBy(f.col('Index')).orderBy(f.isnull('group_id')) df2 = df1.withColumn('rowNum', f.row_number().over(w))\ .where(f.col('rowNum')==1)\ .sort('Index')\ .select( 'Date', 'Type', 'Id', 'org_id', f.when( f.isnull('group_id'), f.col('Id') ).otherwise(f.col('group_id')).alias('group_id') ) df2.show() #+----------+--------+---+------+--------+ #| Date| Type| Id|org_id|group_id| #+----------+--------+---+------+--------+ #|25-03-2018|Purchase|111| null| 111| #|25-03-2018|Purchase|112| null| 112| #|26-03-2018| Return|113| 111| 113| #|26-03-2018|Purchase|111| null| 113| #+----------+--------+---+------+--------+

pyspark rdd map不是调用函数

1 个答案: