我有数据集:
dat = {'Block': ['blk_-105450231192318816', 'blk_-1076549517733373559', 'blk_-1187723472581877455', 'blk_-1385756122847916710', 'blk_-1470784088028862059'], 'Seq': ['13 13 13 15',' 15 13 13', '13 13 15', '13 13 15 13', '13'], 'Time' : ['1257712532.0 1257712532.0 1257712532.0 1257712532.0','1257712533.0 1257712534.0 1257712534.0','1257712533.0 1257712533.0 1257712533.0','1257712532.0 1257712532.0 1257712532.0 1257712534.0','1257712535.0']}
df = pd.DataFrame(data = dat)
块是ID。 Seq是ID。时间是unix格式的时间。 我想更改列或创建新列。
1)我需要通过两列元素的索引来连接Seq和Time列。
2)在我想获取“时间”列(下一个元素-上一个)和第一个元素设置为零的增量之后。
最后,从不同的块写入文件行,但是巫婆具有相同的Seq-id。 我想用熊猫方法解决这个问题
我试图通过字典来解决它,但是这种方式很复杂。
dict_block = dict((key, []) for key in np.unique(df.Block))
for idx, row in enumerate(seq):
block = df.Block[idx]
dict_seq = dict((key, []) for key in np.unique(row.split(' ')))
for idy, key in enumerate(row.split(' ')):
item = df.Time[idx].split(' ')[idy]
dict_seq[key].append(item)
dict_block[block].append(dict_seq)
1)例如:
blk_-105450231192318816 :
13: 1257712532.0, 1257712532.0, 1257712532.0
15: 1257712532.0
2)例如:
blk_-105450231192318816 :
13: 0, (1257712532.0 - 1257712532.0) = 0, (1257712532.0 - 1257712532.0) = 0
15: 0
字典尝试的输出:
{'blk_-105450231192318816':
[{'13': ['1257712532.0', '1257712532.0','1257712532.0'],
'15': ['1257712532.0']}],
'blk_-1076549517733373559':
[{'13': ['1257712534.0', '1257712534.0'],
'15': ['1257712533.0']}],
'blk_-1187723472581877455':
[{'13': ['1257712533.0', '1257712533.0'],
'15': ['1257712533.0']}],
'blk_-1385756122847916710':
[{'13': ['1257712532.0',
'1257712532.0',
'1257712534.0'],
'15': ['1257712532.0']}],
'blk_-1470784088028862059':
[{'13': ['1257712535.0']}]}
摘要:
我想用熊猫,numpy方法解决下一点:
1)分组列
2)获取时间增量(t1-t0)
等待您的评论:)
答案 0 :(得分:1)
如果您更喜欢使用字典,则可以使用apply和自定义方法来处理字典。
df
是您提供的示例数据框。在这里,我做了两种方法。我希望代码足够清晰,可以理解。
def grouping(x):
"""Make a dictionary combining 'Seq' and 'Time' columns.
'Seq' elements are the keys, 'Time' are the values. 'Time' elements
corresponding to the same key are stored in a list.
"""
#splitting the string and make it numeric
keys = list(map(int, x['Seq'].split()))
times = list(map(float, x['Time'].split()))
#building the result dictionary.
res = {}
for i, k in enumerate(keys):
try:
res[k].append(times[i])
except KeyError:
res[k] = [times[i]]
return res
def timediffs(x):
"""Make a dictionary starting from 'GroupedSeq' column, which can
be created with the grouping function.
It contains the difference between the times of each key.
"""
ddt = x['GroupedSeq']
res = {}
#iterating over the dictionary to calculate the differences.
for k, v in ddt.items():
res[k] = [0.0] + [t1 - t0 for t0, t1 in zip(v[:-1], v[1:])]
return res
df['GroupedSeq'] = df.apply(grouping, axis=1)
df['difftimes'] = df.apply(timediffs, axis=1)
apply
的作用是将函数应用于每一行。结果存储在数据框的新列中。现在df
包含两个新列,您可以通过执行以下操作来删除原始的'Seq'
和Time
列:df.drop(['Seq', 'Time'], axis=1, inplace=True)
。最后,df
如下:
Block grouped difftimes
0 blk_-105450231192318816 {13: [1257712532.0, 1257712532.0, 1257712532.0... {13: [0.0, 0.0, 0.0], 15: [0.0]}
1 blk_-1076549517733373559 {15: [1257712533.0], 13: [1257712534.0, 125771... {15: [0.0], 13: [0.0, 0.0]}
2 blk_-1187723472581877455 {13: [1257712533.0, 1257712533.0], 15: [125771... {13: [0.0, 0.0], 15: [0.0]}
3 blk_-1385756122847916710 {13: [1257712532.0, 1257712532.0, 1257712534.0... {13: [0.0, 0.0, 2.0], 15: [0.0]}
4 blk_-1470784088028862059 {13: [1257712535.0]} {13: [0.0]}
如您所见,这里pandas
本身仅用于应用自定义方法,但是在这些方法中,有正常的python代码在起作用。
如果您在数据框中存储列表或字典,熊猫本身不是很有用。因此,我提出了一种不带词典的解决方案。我将groupby与apply
结合使用,根据所选行的值对它们进行操作。
groupby
根据一列或多列的值选择数据帧的子样本:将这些列中具有相同值的所有行进行分组,然后对该子样本执行方法或操作。
同样,df
是您提供的示例数据框。
df1 = df.copy() #working on a copy, not really needed but I wanted to preserve the original
##splitting the string and make it a numeric list using apply
df1['Seq'] = df1['Seq'].apply(lambda x : list(map(int, x.split())))
df1['Time'] = df1['Time'].apply(lambda x : list(map(float, x.split())))
#for each index in 'Block', unnest the list in 'Seq' making it a secodary index.
df2 = df1.groupby('Block').apply(lambda x : pd.DataFrame([[e] for e in x['Time'].iloc[0]], index=x['Seq'].tolist()))
#resetting index and renaming column names created by pandas
df2 = df2.reset_index().rename(columns={'level_1':'Seq', 0:'Time'})
#custom method to store the differences between times.
def timediffs(x):
x['tdiff'] = x['Time'].diff().fillna(0.0)
return x
df3 = df2.groupby(['Block', 'Seq']).apply(timediffs)
最后的df3
是:
Block Seq Time tdiff
0 blk_-105450231192318816 13 1.257713e+09 0.0
1 blk_-105450231192318816 13 1.257713e+09 0.0
2 blk_-105450231192318816 13 1.257713e+09 0.0
3 blk_-105450231192318816 15 1.257713e+09 0.0
4 blk_-1076549517733373559 15 1.257713e+09 0.0
5 blk_-1076549517733373559 13 1.257713e+09 0.0
6 blk_-1076549517733373559 13 1.257713e+09 0.0
7 blk_-1187723472581877455 13 1.257713e+09 0.0
8 blk_-1187723472581877455 13 1.257713e+09 0.0
9 blk_-1187723472581877455 15 1.257713e+09 0.0
10 blk_-1385756122847916710 13 1.257713e+09 0.0
11 blk_-1385756122847916710 13 1.257713e+09 0.0
12 blk_-1385756122847916710 15 1.257713e+09 0.0
13 blk_-1385756122847916710 13 1.257713e+09 2.0
14 blk_-1470784088028862059 13 1.257713e+09 0.0
如您所见,数据框内没有字典。您在'Block'
和'Seq'
列中有重复项,但这是不可避免的。