计算pandas行值的变化?

时间:2018-04-19 09:53:53

标签: python pandas

我试图计算某个行值“Neg”从其默认值0变为1所需的行数,并在一个名为“dsf”的新列中捕获此值,并计算Neg = 1。我尝试了下面的代码片段,我不确定为什么,但这会为所有'dsf'值设置0。

为什么这是错的?

/代码

full_data['dsf'] = 0
counter = 0
for i,r in full_data.iterrows():
    if r['neg'] == 0:
        counter+=1
        r['dsf'] = 0
    else:
        r['dsf'] = counter
        counter = 0
full_data

当前输出:

    datehour            pft     rev         mgn        neg  dsf
0   2018-04-01 00:00:00 53.1783 110.8514    0.479726    0   0
1   2018-04-01 00:30:00 51.1496 105.9060    0.482972    0   0
2   2018-04-01 01:00:00 42.9360 120.7555    0.355561    1   0
3   2018-04-01 01:30:00 37.8455 114.5514    0.330380    0   0
4   2018-04-01 02:00:00 43.9254 99.1340     0.443091    1   0

理想输出:

    datehour            pft     rev         mgn         neg dsf
0   2018-04-01 00:00:00 53.1783 110.8514    0.479726    0   0
1   2018-04-01 00:30:00 51.1496 105.9060    0.482972    0   0
2   2018-04-01 01:00:00 42.9360 120.7555    0.355561    1   3
3   2018-04-01 01:30:00 37.8455 114.5514    0.330380    0   0
4   2018-04-01 02:00:00 43.9254 99.1340     0.443091    1   2

3 个答案:

答案 0 :(得分:1)

您应该在for循环之外初始化计数器。这是一个例子:

df = pd.DataFrame({'neg': [0, 0, 1, 0, 1]}) 

df['dsf'] = 0
counter  = 1

for i, j in df.iterrows():
 if j['neg'] == 0:
  j['dsf'] = 0
  counter += 1
else:
  j['dsf'] = counter
  counter = 1

df  

输出:

   neg dsf
0   0   0
1   0   0
2   1   3
3   0   0
4   1   2

请注意,结果与您想要的输出完全相同。 但是如果你只想计算空值,那么你应该在for循环的外部和结尾初始化计数为0。结果应该是这样的:

neg dsf
0   0   0
1   0   0
2   1   2
3   0   0
4   1   1

答案 1 :(得分:0)

来自iterrows docs:

  

你永远不应该修改你正在迭代的东西。这并不能保证在所有情况下都有效。根据数据类型,迭代器返回一个副本而不是视图,写入它将不起作用。

因此,在您的情况下,在for循环中,您不会修改原始DataFrame,因为iterrows会返回副本。有关视图和副本的更多详细信息,请阅读http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

以下是您的代码的修复版本:

df = pd.DataFrame([
    ['2018-04-01 00:00:00', 53.1783, 110.8514, 0.479726, 0], 
    ['2018-04-01 00:30:00', 51.1496, 105.9060, 0.482972, 0], 
    ['2018-04-01 01:00:00', 42.9360, 120.7555, 0.355561, 1], 
    ['2018-04-01 01:30:00', 37.8455, 114.5514, 0.330380, 0], 
    ['2018-04-01 02:00:00', 43.9254, 99.1340,  0.443091, 1]], 
    columns=['datehour', 'pft', 'rev', 'mgn', 'neg'])

df['dsf'] = 0
counter = 0

for i,r in df.iterrows():
    counter += 1
    if r['neg'] != 0:
        df.loc[i, 'dsf'] = counter
        counter = 0

print(df)
#                datehour     pft      rev         mgn   neg      dsf
# 0   2018-04-01 00:00:00 53.1783 110.8514    0.479726    0         0
# 1   2018-04-01 00:30:00 51.1496 105.9060    0.482972    0         0
# 2   2018-04-01 01:00:00 42.9360 120.7555    0.355561    1         3
# 3   2018-04-01 01:30:00 37.8455 114.5514    0.330380    0         0
# 4   2018-04-01 02:00:00 43.9254 99.1340     0.443091    1         2

答案 2 :(得分:0)

对于您的问题,这是一个不同的解决方案,与使用iterrows相比应该快得多。你应该总是尝试使用pandas尽可能多的矢量化。

df = pd.DataFrame({'neg': [0,0,1, 0, 1,0, 0, 1]})
indexes = df[df['neg'] == 1].index
shifted = indexes + 1
values = indexes - indexes.to_series().shift().fillna(0)
df.assign(dfs=pd.Series(vals, index=indexes)).fillna(0)

    neg dfs
0   0   0.0
1   0   0.0
2   1   3.0
3   0   0.0
4   1   2.0
5   0   0.0
6   0   0.0
7   1   3.0

如果您希望自己可以将dfs列转换为int