我不能使用简单的测试集来重现此问题,它仅在我的数据集上发生。所以我只能告诉情况。
df
有许多不同的store_id,product_id组,每个组有许多行。
df1
具有许多不同的store_id,product_id组,仅包含一行。
df是订单历史记录表,我需要从中获取历史价格,并从df1获取当前价格。联合他们以构建一条全时价格变动线。
但是奇怪的是
sid = '00fbb2a6-f2de-42f1-a07b-163e3a050ddb'
pid = '66e06f08-dec2-498d-883f-24771da18358'
filtersp = lambda df: df.filter(col('store_id')==sid).filter(col('product_id')==pid)
filtersp(df).show()
+----------------+--------+----------+-----------+---+
|store_product_id|store_id|product_id|price_guide| ds|
+----------------+--------+----------+-----------+---+
+----------------+--------+----------+-----------+---+
filtersp(df1).show()
+----------------+----------+--------+-----------+---+
|store_product_id|product_id|store_id|price_guide| ds|
+----------------+----------+--------+-----------+---+
+----------------+----------+--------+-----------+---+
filtersp(df1).union(filtersp(df)).show()
+----------------+----------+--------+-----------+---+
|store_product_id|product_id|store_id|price_guide| ds|
+----------------+----------+--------+-----------+---+
+----------------+----------+--------+-----------+---+
filtersp(df1.union(df)).show()
+----------------+----------+--------+-----------+---+
|store_product_id|product_id|store_id|price_guide| ds|
+----------------+----------+--------+-----------+---+
+----------------+----------+--------+-----------+---+
filtersp(df.union(df1)).show()
+--------------------+--------------------+--------------------+-----------+-------------------+
| store_product_id| store_id| product_id|price_guide| ds|
+--------------------+--------------------+--------------------+-----------+-------------------+
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|
+--------------------+--------------------+--------------------+-----------+-------------------+
然后我添加一个新列来跟踪这些行的来源
df = df.withColumn('c', lit('df'))
df1 = df1.withColumn('c', lit('df1'))
filtersp(df.union(df1)).show()
+--------------------+--------------------+--------------------+-----------+-------------------+---+
| store_product_id| store_id| product_id|price_guide| ds| c|
+--------------------+--------------------+--------------------+-----------+-------------------+---+
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|df1|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|df1|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|df1|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|df1|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|df1|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|df1|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|df1|
|996864cf-8432-43d...|00fbb2a6-f2de-42f...|66e06f08-dec2-498...| 480|2019-08-06 09:00:00|df1|
+--------------------+--------------------+--------------------+-----------+-------------------+---+
查找来自df1的行。
我不知道filtersp(df.union(df1)).show()
在什么情况下会显示结果,这是不可能的。
答案 0 :(得分:0)
敲我自己。尽管我找到了答案https://stackoverflow.com/a/55310670/1637673:
def unionByName(other: Dataset[T]): Dataset[T]
此函数与并集之间的区别在于此函数按名称(而不是按位置)解析列:
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
df1.union(df2).show
// output:
// +----+----+----+
// |col0|col1|col2|
// +----+----+----+
// | 1| 2| 3|
// | 4| 5| 6|
// +----+----+----+
但是我不认为我有这个问题,经过一些努力最终发现列顺序是不同的。
df是
+----------------+--------+----------+-----------+---+---+
|store_product_id|store_id|product_id|price_guide| ds| c|
+----------------+--------+----------+-----------+---+---+
+----------------+--------+----------+-----------+---+---+
df1是
+----------------+----------+--------+-----------+---+---+
|store_product_id|product_id|store_id|price_guide| ds| c|
+----------------+----------+--------+-----------+---+---+
+----------------+----------+--------+-----------+---+---+
product_id|store_id
的位置不同。