我试图从bugzilla数据库处理错误状态数据。这是我的df.head()
bug_id creation_ts added bug_when
0 194402 2006-06-07 15:40:13 ASSIGNED 2006-07-29 09:34:04
1 194402 2006-06-07 15:40:13 NEEDINFO 2007-05-30 17:28:46
2 194402 2006-06-07 15:40:13 ASSIGNED 2007-05-31 09:20:40
3 194402 2006-06-07 15:40:13 CLOSED 2012-03-28 10:54:12
4 200247 2006-07-26 10:40:03 CLOSED 2006-08-14 12:05:47
这列出了错误 194402 和 200247 的错误状态活动。 Bugzilla在创建bug时没有活动记录。我想知道是否有一个简单的熊猫方式通过复制另一行的信息来添加记录?我想将 creation_ts 用作 bug_when ,添加 新值。这将产生以下结果:
bug_id creation_ts added bug_when
0 194402 2006-06-07 15:40:13 NEW 2006-06-07 15:40:13
1 194402 2006-06-07 15:40:13 ASSIGNED 2006-07-29 09:34:04
2 194402 2006-06-07 15:40:13 NEEDINFO 2007-05-30 17:28:46
3 194402 2006-06-07 15:40:13 ASSIGNED 2007-05-31 09:20:40
4 194402 2006-06-07 15:40:13 CLOSED 2012-03-28 10:54:12
5 200247 2006-07-26 10:40:03 NEW 2006-07-26 10:40:03
6 200247 2006-07-26 10:40:03 CLOSED 2006-08-14 12:05:47
或者我是否需要为每个bug创建子数据帧然后在那里添加记录来解决这个问题?
我试过以下
df = DataFrame(data=list(activities), columns=activities.keys())
# setup empty dataframe to store processed rows
xf = DataFrame(columns=['bug_id', 'added', 'bug_when'])
# set bug_id and creation_ts as index
df = df.set_index(['bug_id','creation_ts'])
# loop through indexes
with Timer() as t:
for index in set(df.index):
bug_id, creation_ts = index
# setup new row
new_row = dict(bug_id=bug_id, bug_when=creation_ts, added='NEW')
# convert row to dataframe and append
xf = xf.append( DataFrame([new_row]), ignore_index=True)
# get bug activties from dataframe by index
bug_activities = df.ix[index]
# add 'bug_id' row as index is ignored
bug_activities['bug_id'] = bug_id
# append bug_activities
xf = xf.append( bug_activities, ignore_index=True)
logging.info("pandas done in %s" % t.interval)
但是在100,1000和1000条记录上运行它需要0.75,8.29和146.58秒,这是不好的。
非常感谢你的帮助
答案 0 :(得分:1)
我将'bug_id'列分组,然后获取第一个条目并将其追加到您的数据框中:
In [67]:
# groupby bug_id, take first of each group and reset the index
first = df.groupby('bug_id').first().reset_index()
# now assign the timestamp and set the added column to 'NEW'
first['bug_when'], first['added'] = first['creation_ts'], 'NEW'
first
Out[67]:
bug_id creation_ts added bug_when
0 194402 2006-06-07 15:40:13 NEW 2006-06-07 15:40:13
1 200247 2006-07-26 10:40:0 NEW 2006-07-26 10:40:0
In [68]:
# append back to dataframe and ignore the index so it is unique
df.append(first, ignore_index=True)
Out[68]:
bug_id creation_ts added bug_when
0 194402 2006-06-07 15:40:13 ASSIGNED 2006-07-29 09:34:04
1 194402 2006-06-07 15:40:13 NEEDINFO 2007-05-30 17:28:46
2 194402 2006-06-07 15:40:13 ASSIGNED 2007-05-31 09:20:40
3 194402 2006-06-07 15:40:13 CLOSED 2012-03-28 10:54:12
4 200247 2006-07-26 10:40:0 CLOSED 2006-08-14 12:05:47
5 194402 2006-06-07 15:40:13 NEW 2006-06-07 15:40:13
6 200247 2006-07-26 10:40:0 NEW 2006-07-26 10:40:0