我得到了一个如下数据框。 我想用总步数来创建一个新列。 我有一张下面的桌子。 您可以看到ID 1有5个步骤。
+----+--------------------------------------------------------+
| ID | Steps |
+----+--------------------------------------------------------+
| 1 | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> |
| 2 | <DIV><P>Another step</P></DIV> |
| | <DIV><P>Something</P></DIV> |
| | <DIV><P>Something</P></DIV> |
| | <DIV><P>Something</P></DIV> |
| | <DIV><P>Something</P></DIV> |
+----+--------------------------------------------------------+
我想使用“ DIV”通过正确的ID来计算步骤总数,并在步骤总数中添加一个新列。
+----+--------------------------------------------------------+-------------+
| ID | Steps | Total_Steps |
+----+--------------------------------------------------------+-------------+
| 1 | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> | 10 |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> | |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> | |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> | |
| | <DIV><P>Another step</P></DIV><DIV><P>A step</P></DIV> | |
| 2 | <DIV><P>Another step</P></DIV> | 5 |
| | <DIV><P>Something</P></DIV> | |
| | <DIV><P>Something</P></DIV> | |
| | <DIV><P>Something</P></DIV> | |
| | <DIV><P>Something</P></DIV> | |
| 3 | <DIV><P>Just a step</P></DIV> | 4 |
| | <DIV><P>Just a step</P></DIV> | |
| | <DIV><P>Just a step</P></DIV> | |
| | <DIV><P>Just a step</P></DIV> | |
+----+--------------------------------------------------------+-------------+
答案 0 :(得分:1)
将Series.str.count
与GroupBy.transform
和sum
一起使用:
df['Total_Steps'] = df['Steps'].str.count('<DIV>').groupby(df['ID'].ffill()).transform('sum')
print (df)
ID Steps Total_Steps
0 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10
1 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10
2 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10
3 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10
4 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10
5 2 <DIV><P>Another step</P></DIV> 5
6 2 <DIV><P>Something</P></DIV> 5
7 2 <DIV><P>Something</P></DIV> 5
8 2 <DIV><P>Something</P></DIV> 5
9 2 <DIV><P>Something</P></DIV> 5
如果仅需要第一个值,请在numpy.where
上加上Series.duplicated
:
s = df['Steps'].str.count('<DIV>').groupby(df['ID'].ffill()).transform('sum')
df['Total_Steps'] = np.where(df['ID'].duplicated(), np.nan, s)
#possible mixed values - numeric with empty strings, but then some function should failed
#df['Total_Steps'] = np.where(df['ID'].duplicated(), '', s)
print (df)
ID Steps Total_Steps
0 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... 10.0
1 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... NaN
2 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... NaN
3 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... NaN
4 1 <DIV><P>Another step</P></DIV><DIV><P>A step</... NaN
5 2 <DIV><P>Another step</P></DIV> 5.0
6 2 <DIV><P>Something</P></DIV> NaN
7 2 <DIV><P>Something</P></DIV> NaN
8 2 <DIV><P>Something</P></DIV> NaN
9 2 <DIV><P>Something</P></DIV> NaN
答案 1 :(得分:0)
为什么不这样:
df['Total_Steps']=df['steps'].str.contains('\<Div\>\<P\>').count()