我有一个名为stores_df的数据框,其中包含商店信息,例如日期和销售额。我有另一个名为avg_sales_store_by_month的数据框,其中包含每个商店每个月的平均销售额。我希望从中得到平均销售列,将其附加到stores_df。我遇到的问题是在我加入之后,stores_df的顺序发生了变化
以下是stores_df的前几行。
+-----+----------+---------+----+------------+-----------+----------+-----------+------------+-----+----+---+
|Store| Date|IsHoliday|Dept|Weekly_Sales|Temperature|Fuel_Price| CPI|Unemployment|Month|Year|Day|
+-----+----------+---------+----+------------+-----------+----------+-----------+------------+-----+----+---+
| 1|2010-02-05| FALSE| 1| 24924| 42.31| 2.572|211.0963582| 8.106| 2|2010| 5|
| 1|2010-02-12| TRUE| 1| 46039| 38.51| 2.548|211.2421698| 8.106| 2|2010| 12|
| 1|2010-02-19| FALSE| 1| 41595| 39.93| 2.514|211.2891429| 8.106| 2|2010| 19|
| 1|2010-05-14| FALSE| 1| 18926| 74.78| 2.854|210.3374261| 7.808| 5|2010| 14|
+-----+----------+---------+----+------------+-----------+----------+-----------+------------+-----+----+---+
下面是avg_sales_store_by_month的前几行,我希望获取最后一列并将其附加到stores_df的末尾。
+-----+-----+------------------+
|Store|Month|avg_sales_by_month|
+-----+-----+------------------+
| 39| 11| 23317.75|
| 43| 7| 13090.84|
| 10| 2| 28407.05|
| 23| 6| 21265.7|
| 4| 10| 28723.2|
| 9| 10| 8468.2|
+-----+-----+------------------+
我的问题是当我使用我的加入时:
stores_df = stores_df.join( avg_sales_store_by_month, Seq("Store", "Month"), "left" )
stores_df的行被重新排序,我希望它与连接之前的顺序相同,但是使用额外的列。我如何实现这一目标?
在连接片段之后,订单搞砸了。
+-----+-----+----------+---------+----+------------+-----------+----------+-----------+------------+----+---+------------------+
|Store|Month| Date|IsHoliday|Dept|Weekly_Sales|Temperature|Fuel_Price| CPI|Unemployment|Year|Day|avg_sales_by_month|
+-----+-----+----------+---------+----+------------+-----------+----------+-----------+------------+----+---+------------------+
| 39| 11|2010-11-05| FALSE| 1| 31729| 61.62| 2.689|210.7202444| 8.476|2010| 5| 23317.75|
| 39| 11|2010-11-12| FALSE| 1| 12324| 62.21| 2.728|210.7667944| 8.476|2010| 12| 23317.75|
| 39| 11|2010-11-19| FALSE| 1| 15137| 55.5| 2.771| 210.65429| 8.476|2010| 19| 23317.75|
| 39| 11|2011-11-11| FALSE| 2| 65758| 63.11| 3.297|216.7217373| 7.716|2011| 11| 23317.75|
| 39| 11|2011-11-18| FALSE| 2| 70050| 66.09| 3.308|216.9395861| 7.716|2011| 18| 23317.75|
+-----+-----+----------+---------+----+------------+-----------+----------+-----------+------------+----+---+------------------+
答案 0 :(得分:1)
如果要保留原始列顺序,可以将第一个数据框的列与数组中的附加列一起保存,并在连接后选择它们,如下例所示:
val df1 = Seq(
(1, 25000, 3, 2010, 20),
(1, 30000, 3, 2010, 27),
(1, 20000, 4, 2010, 3),
(2, 40000, 3, 2010, 20),
(2, 35000, 3, 2010, 27),
(2, 35000, 4, 2010, 3)
).toDF("Store", "Wk_Sales", "Month", "year", "Day")
val df2 = Seq(
(1, 3, 100000),
(1, 4, 90000),
(2, 3, 140000),
(2, 4, 110000)
).toDF("Store", "Month", "Mo_Sales")
val joinedDF = df1.join(df2, Seq("Store", "Month"), "left")
// +-----+-----+--------+----+---+--------+
// |Store|Month|Wk_Sales|year|Day|Mo_Sales|
// +-----+-----+--------+----+---+--------+
// | 1| 3| 25000|2010| 20| 100000|
// | 1| 3| 30000|2010| 27| 100000|
// | 1| 4| 20000|2010| 3| 90000|
// | 2| 3| 40000|2010| 20| 140000|
// | 2| 3| 35000|2010| 27| 140000|
// | 2| 4| 35000|2010| 3| 110000|
// +-----+-----+--------+----+---+--------+
val cols = df1.columns :+ "Mo_Sales"
joinedDF.select(cols.head, cols.tail: _*).
show
// +-----+--------+-----+----+---+--------+
// |Store|Wk_Sales|Month|year|Day|Mo_Sales|
// +-----+--------+-----+----+---+--------+
// | 1| 25000| 3|2010| 20| 100000|
// | 1| 30000| 3|2010| 27| 100000|
// | 1| 20000| 4|2010| 3| 90000|
// | 2| 40000| 3|2010| 20| 140000|
// | 2| 35000| 3|2010| 27| 140000|
// | 2| 35000| 4|2010| 3| 110000|
// +-----+--------+-----+----+---+--------+