我预处理数据有
的数据id ref text
+------+------+--------------------+
| 8309| 3129|3 MO F/U HAIR LOS...|
| 8309| 3129| 4 MO SKIN CK|
| 8309| 3129| 4 MO F/U LM AG|
| 8309| 3129|HAIR LOSS AND SPO...|
| 8309| 3129| 2 MO F/U CONF KC|
| 8309| 3129|SSR AND DISCUSS H...|
| 4569| 1101|F/U LM TO CONFIRM...|
| 4569| 1101|F/U (LF) LM TO CO...|
| 4569| 1101| FU CONFIRMED|
| 4569| 1101|F/U MRI RESULTS ...|
| 4569| 1101|F/U AFTER MRI JC ...|
| 4569| 1101| FU|
| 4569| 1101|F/U AND NEW PROBL...|
| 4569| 1101| F/U|
| 4569| 1101| FU CONFIRMED|
| 4569| 1101|REVIEW MRI ...|
| 4569| 1101|REVIEW MRI RESULT...|
+------+------+--------------------+
我想像这样转换此Dataframe
id ref text
+--------+------+--------------------+
| 8309 | 3129|3 MO F/U HAIR LOS...|
| 8309_1| 3129| 4 MO SKIN CK|
| 8309_2| 3129| 4 MO F/U LM AG|
| 8309_3| 3129|HAIR LOSS AND SPO...|
| 8309_4| 3129| 2 MO F/U CONF KC|
| 8309_5| 3129|SSR AND DISCUSS H...|
| 4569 | 1101|F/U LM TO CONFIRM...|
| 4569_1| 1101|F/U (LF) LM TO CO...|
| 4569_2| 1101| FU CONFIRMED|
| 4569_3| 1101|F/U MRI RESULTS ...|
|--------|------|--------------------|
我希望只使用唯一编号绑定重复的ID。如果它不是增量的,那就没问题了。
答案 0 :(得分:1)
使用GroupBy.cumcount
计算:
df['id'] = (df['id'].astype(str).add(df.groupby('id')
.cumcount()
.astype(str)
.radd('_')
.replace('_0','')))
print (df)
id ref text
0 8309 3129 3 MO F/U HAIR LOS...
1 8309_1 3129 4 MO SKIN CK
2 8309_2 3129 4 MO F/U LM AG
3 8309_3 3129 HAIR LOSS AND SPO...
4 8309_4 3129 2 MO F/U CONF KC
5 8309_5 3129 SSR AND DISCUSS H...
6 4569 1101 F/U LM TO CONFIRM...
7 4569_1 1101 F/U (LF) LM TO CO...
8 4569_2 1101 FU CONFIRMED
9 4569_3 1101 F/U MRI RESULTS ...
10 4569_4 1101 F/U AFTER MRI JC ...
11 4569_5 1101 FU
12 4569_6 1101 F/U AND NEW PROBL...
13 4569_7 1101 F/U
14 4569_8 1101 FU CONFIRMED
15 4569_9 1101 REVIEW MRI...
16 4569_10 1101 REVIEW MRI RESULT...
答案 1 :(得分:1)
您可以结合使用row_number()
,lag()
,when
和window
功能来获得所需的结果
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
def windowSpec = Window.partitionBy("id").orderBy("ref")
df.withColumn("rank", lag(row_number().over(windowSpec), 1).over(windowSpec))
.withColumn("id", when($"rank".isNotNull, concat_ws("_", $"id", $"rank")).otherwise($"id"))
.drop("rank")
.show(false)
您应该获得最终dataframe
+-------+----+--------------------+
|id |ref |text |
+-------+----+--------------------+
|4569 |1101|F/U LM TO CONFIRM...|
|4569_1 |1101|F/U (LF) LM TO CO...|
|4569_2 |1101| FU CONFIRMED|
|4569_3 |1101|F/U MRI RESULTS ...|
|4569_4 |1101|F/U AFTER MRI JC ...|
|4569_5 |1101| FU|
|4569_6 |1101|F/U AND NEW PROBL...|
|4569_7 |1101| F/U|
|4569_8 |1101| FU CONFIRMED|
|4569_9 |1101|REVIEW MRI ...|
|4569_10|1101|REVIEW MRI RESULT...|
|8309 |3129|3 MO F/U HAIR LOS...|
|8309_1 |3129| 4 MO SKIN CK|
|8309_2 |3129| 4 MO F/U LM AG|
|8309_3 |3129|HAIR LOSS AND SPO...|
|8309_4 |3129| 2 MO F/U CONF KC|
|8309_5 |3129|SSR AND DISCUSS H...|
+-------+----+--------------------+