例如,我在一条生产线中有一个测量值:
import pandas as pd
import random
random.seed(2)
df = pd.DataFrame()
df['col1_date'] = pd.date_range('2018-09-01', periods=40, freq='D')
df['col2_id'] = pd.DataFrame(list(['A', 'B', 'C', 'D'] *10))
df['MeasuredValues'] = np.random.choice(10, 40)
df
基本上,例如['A','B','C','D']之类的项目会随着行中的进展而进行某些参数的测量。我正在尝试在表格中添加一列,以便为col2_id
中的每个项目提供一系列的测量值,例如:第一次测量时,我得到1st measurement
,然后对于第二次测量,我得到2nd measurement
,依此类推。我可以通过对col2_id
进行排序,然后对col1_date
进行排序,在excel中手动创建100个小表。显然,对于成千上万的行而言,这没有多大意义。然后,我可以比较col2_id
中所有项目的不同度量。我不太了解如何在Pandas或Python中进行操作。
有人可以给我一些建议吗?
答案 0 :(得分:3)
如果对日期进行了排序,则可以使用cumcount
:
df.join(
df.item.str.findall('(a|c)')
.str.join('|')
.str.get_dummies().add_prefix('contains_')
)
id item contains_a contains_c
0 1 a,b,c 1 1
1 2 c,d,e 0 1
2 3 a,b 1 0
或者:
df['measurement_number'] = 'measurement '+(df.groupby('col2_id').cumcount()+1).astype(str)
两个都给你
df['measurement_number'] = (df.groupby('col2_id').cumcount()+1).map(lambda x: f'measurement {x}')
如果您的日期未排序,请先对其进行排序。进行与上述相同的操作,但使用:
>>> df
col1_date col2_id MeasuredValues measurement_number
0 2018-09-01 A 8 measurement 1
1 2018-09-02 B 8 measurement 1
2 2018-09-03 C 6 measurement 1
3 2018-09-04 D 2 measurement 1
4 2018-09-05 A 8 measurement 2
5 2018-09-06 B 7 measurement 2
6 2018-09-07 C 2 measurement 2
7 2018-09-08 D 1 measurement 2
8 2018-09-09 A 5 measurement 3
9 2018-09-10 B 4 measurement 3
10 2018-09-11 C 4 measurement 3
11 2018-09-12 D 5 measurement 3
12 2018-09-13 A 7 measurement 4
13 2018-09-14 B 3 measurement 4
14 2018-09-15 C 6 measurement 4
15 2018-09-16 D 4 measurement 4
16 2018-09-17 A 3 measurement 5
17 2018-09-18 B 7 measurement 5
18 2018-09-19 C 6 measurement 5
19 2018-09-20 D 1 measurement 5
20 2018-09-21 A 3 measurement 6
21 2018-09-22 B 5 measurement 6
22 2018-09-23 C 8 measurement 6
23 2018-09-24 D 4 measurement 6
24 2018-09-25 A 6 measurement 7
25 2018-09-26 B 3 measurement 7
26 2018-09-27 C 9 measurement 7
27 2018-09-28 D 2 measurement 7
28 2018-09-29 A 0 measurement 8
29 2018-09-30 B 4 measurement 8
30 2018-10-01 C 2 measurement 8
31 2018-10-02 D 4 measurement 8
32 2018-10-03 A 1 measurement 9
33 2018-10-04 B 7 measurement 9
34 2018-10-05 C 8 measurement 9
35 2018-10-06 D 2 measurement 9
36 2018-10-07 A 9 measurement 10
37 2018-10-08 B 8 measurement 10
38 2018-10-09 C 7 measurement 10
39 2018-10-10 D 1 measurement 10
第一。
答案 1 :(得分:2)
groupby
和rank
df.assign(
meas_num=
df.groupby('col2_id').col1_date.rank().apply('{:.0f}'.format).radd('meas ')
)
col1_date col2_id MeasuredValues meas_num
0 2018-09-01 A 1 meas 1
1 2018-09-02 B 3 meas 1
2 2018-09-03 C 9 meas 1
3 2018-09-04 D 7 meas 1
4 2018-09-05 A 3 meas 2
5 2018-09-06 B 5 meas 2
6 2018-09-07 C 5 meas 2
7 2018-09-08 D 2 meas 2
8 2018-09-09 A 4 meas 3
9 2018-09-10 B 0 meas 3
10 2018-09-11 C 2 meas 3
11 2018-09-12 D 0 meas 3
12 2018-09-13 A 8 meas 4
13 2018-09-14 B 5 meas 4
14 2018-09-15 C 2 meas 4
15 2018-09-16 D 0 meas 4
16 2018-09-17 A 7 meas 5
17 2018-09-18 B 1 meas 5
18 2018-09-19 C 3 meas 5
19 2018-09-20 D 7 meas 5
20 2018-09-21 A 9 meas 6
21 2018-09-22 B 5 meas 6
22 2018-09-23 C 7 meas 6
23 2018-09-24 D 7 meas 6
24 2018-09-25 A 9 meas 7
25 2018-09-26 B 3 meas 7
26 2018-09-27 C 2 meas 7
27 2018-09-28 D 9 meas 7
28 2018-09-29 A 1 meas 8
29 2018-09-30 B 8 meas 8
30 2018-10-01 C 3 meas 8
31 2018-10-02 D 8 meas 8
32 2018-10-03 A 6 meas 9
33 2018-10-04 B 4 meas 9
34 2018-10-05 C 1 meas 9
35 2018-10-06 D 2 meas 9
36 2018-10-07 A 5 meas 10
37 2018-10-08 B 9 meas 10
38 2018-10-09 C 9 meas 10
39 2018-10-10 D 1 meas 10