我有以下数据集:
import pyspark.sql.functions as f
df2 = df.withColumn('len', f.substring('Length', 15, 10))
df2.show(10, False)
+----------------------------------------------------+----+----------+
|Length |ID |len |
+----------------------------------------------------+----+----------+
|+++++++++++++++++++++++++XXXXX++++++++++++++XXXXXXXX|1.0 |++++++++++|
|XXXXXX++++++++++++XXXXXX+++++++++++++++XXXXXXXXXXXXX|2.0 |++++XXXXXX|
|++++++++++++++++++XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|3.0 |++++XXXXXX|
|XXXXXXXXXXXXXX++++++++++++++++++++XXXXXXXXXXXXXXXXXX|4.0 |++++++++++|
|+++++++++++++++++++++++++XXXXXXXXXXXXXXXXXXXXXXXXXXX|5.0 |++++++++++|
|+++++++++++++++++++++++++XXXXX++++++++++++++XXXXXXXX|6.0 |++++++++++|
|XXXXXX++++++++++++XXXXXX+++++++++++++++XXXXXXXXXXXXX|7.0 |++++XXXXXX|
|++++++++++++++++++XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|8.0 |++++XXXXXX|
|XXXXXXXXXXXXXX++++++++++++++++++++XXXXXXXXXXXXXXXXXX|9.0 |++++++++++|
|+++++++++++++++++++++++++XXXXXXXXXXXXXXXXXXXXXXXXXXX|10.0|++++++++++|
+----------------------------------------------------+----+----------+
df2.filter("len = 'XXXXXXXXXX'").show(10, False)
+------+---+---+
|Length|ID |len|
+------+---+---+
+------+---+---+
我想将x y z合并到另一个数据帧中,如下所示:
A B C D E F
154.6175111 148.0112337 155.7859835 1 1 x
255 253.960131 242.5382584 1 1 x
251.9665958 235.1105659 185.9121703 1 1 x
137.9974994 225.3985177 254.4420772 1 1 x
85.74722877 116.7060415 158.4608395 1 1 x
123.6969939 140.0524405 132.6798037 1 1 x
133.3251695 80.08976196 38.81201612 1 1 y
118.0718812 243.5927927 255 1 1 y
189.5557302 139.9046713 91.90519519 1 1 y
172.3117291 188.000268 129.8155501 1 1 y
48.07634611 21.9183119 25.99669279 1 1 y
23.40525987 8.395857933 25.62371342 1 1 y
228.753009 164.0697727 172.6624107 1 1 z
203.3405006 173.9368303 189.8103708 1 1 z
184.9801932 117.1591341 87.94739034 1 1 z
29.55251224 46.03945452 70.7433477 1 1 z
143.6159623 120.6170926 155.0736604 1 1 z
142.5421179 128.8916843 169.6013111 1 1 z
我希望每个x y z值都具有这些数据帧,例如第一,第二,第三等等。
我如何选择和组合它们?
所需的输出:
A B C D E F
154.6175111 148.0112337 155.7859835 1 1 x ->first x value
133.3251695 80.08976196 38.81201612 1 1 y ->first y value
228.753009 164.0697727 172.6624107 1 1 z ->first z value
答案 0 :(得分:1)
使用GroupBy.cumcount
作为计数器,然后由另一个groupby对象循环:
g = df.groupby('F').cumcount()
for i, g in df.groupby(g):
print (g)