通过复制现有记录的一部分将记录添加到pandas数据帧

时间:2014-07-29 02:59:00

标签: pandas

我试图从bugzilla数据库处理错误状态数据。这是我的df.head()

   bug_id         creation_ts     added            bug_when
0  194402 2006-06-07 15:40:13  ASSIGNED 2006-07-29 09:34:04
1  194402 2006-06-07 15:40:13  NEEDINFO 2007-05-30 17:28:46
2  194402 2006-06-07 15:40:13  ASSIGNED 2007-05-31 09:20:40
3  194402 2006-06-07 15:40:13  CLOSED   2012-03-28 10:54:12
4  200247 2006-07-26 10:40:03  CLOSED   2006-08-14 12:05:47

这列出了错误 194402 200247 的错误状态活动。 Bugzilla在创建bug时没有活动记录。我想知道是否有一个简单的熊猫方式通过复制另一行的信息来添加记录?我想将 creation_ts 用作 bug_when 添加 值。这将产生以下结果:

   bug_id         creation_ts     added            bug_when
0  194402 2006-06-07 15:40:13  NEW      2006-06-07 15:40:13
1  194402 2006-06-07 15:40:13  ASSIGNED 2006-07-29 09:34:04
2  194402 2006-06-07 15:40:13  NEEDINFO 2007-05-30 17:28:46
3  194402 2006-06-07 15:40:13  ASSIGNED 2007-05-31 09:20:40
4  194402 2006-06-07 15:40:13  CLOSED   2012-03-28 10:54:12
5  200247 2006-07-26 10:40:03  NEW      2006-07-26 10:40:03
6  200247 2006-07-26 10:40:03  CLOSED   2006-08-14 12:05:47

或者我是否需要为每个bug创建子数据帧然后在那里添加记录来解决这个问题?

我试过以下

df = DataFrame(data=list(activities), columns=activities.keys())
# setup empty dataframe to store processed rows
xf = DataFrame(columns=['bug_id', 'added', 'bug_when'])
# set bug_id and creation_ts as index
df = df.set_index(['bug_id','creation_ts'])
# loop through indexes
with Timer() as t:
    for index in set(df.index):
        bug_id, creation_ts = index
        # setup new row
        new_row = dict(bug_id=bug_id, bug_when=creation_ts, added='NEW')
        # convert row to dataframe and append
        xf = xf.append( DataFrame([new_row]), ignore_index=True)
        # get bug activties from dataframe by index
        bug_activities = df.ix[index]
        # add 'bug_id' row as index is ignored
        bug_activities['bug_id'] = bug_id
        # append bug_activities
        xf = xf.append( bug_activities, ignore_index=True)
logging.info("pandas done in %s" % t.interval)

但是在100,1000和1000条记录上运行它需要0.75,8.29和146.58秒,这是不好的。

非常感谢你的帮助

1 个答案:

答案 0 :(得分:1)

我将'bug_id'列分组,然后获取第一个条目并将其追加到您的数据框中:

In [67]:
# groupby bug_id, take first of each group and reset the index
first = df.groupby('bug_id').first().reset_index()
# now assign the timestamp and set the added column to 'NEW'
first['bug_when'], first['added'] = first['creation_ts'], 'NEW'
first

Out[67]:
   bug_id          creation_ts added             bug_when
0  194402  2006-06-07 15:40:13   NEW  2006-06-07 15:40:13
1  200247   2006-07-26 10:40:0   NEW   2006-07-26 10:40:0

In [68]:
# append back to dataframe and ignore the index so it is unique
df.append(first, ignore_index=True)

Out[68]:
   bug_id          creation_ts     added             bug_when
0  194402  2006-06-07 15:40:13  ASSIGNED  2006-07-29 09:34:04
1  194402  2006-06-07 15:40:13  NEEDINFO  2007-05-30 17:28:46
2  194402  2006-06-07 15:40:13  ASSIGNED  2007-05-31 09:20:40
3  194402  2006-06-07 15:40:13    CLOSED  2012-03-28 10:54:12
4  200247   2006-07-26 10:40:0    CLOSED  2006-08-14 12:05:47
5  194402  2006-06-07 15:40:13       NEW  2006-06-07 15:40:13
6  200247   2006-07-26 10:40:0       NEW   2006-07-26 10:40:0