重塑熊猫数据框并使用列

时间:2019-08-06 13:41:16

标签: pandas numpy

我有数据集:

dat = {'Block': ['blk_-105450231192318816', 'blk_-1076549517733373559', 'blk_-1187723472581877455', 'blk_-1385756122847916710',  'blk_-1470784088028862059'], 'Seq': ['13 13 13 15',' 15 13 13', '13 13 15', '13 13 15 13', '13'], 'Time' : ['1257712532.0 1257712532.0 1257712532.0 1257712532.0','1257712533.0 1257712534.0 1257712534.0','1257712533.0 1257712533.0 1257712533.0','1257712532.0 1257712532.0 1257712532.0 1257712534.0','1257712535.0']}

df = pd.DataFrame(data = dat)

块是ID。 Seq是ID。时间是unix格式的时间。 我想更改列或创建新列。

1)我需要通过两列元素的索引来连接Seq和Time列。

2)在我想获取“时间”列(下一个元素-上一个)和第一个元素设置为零的增量之后。

最后,从不同的块写入文件行,但是巫婆具有相同的Seq-id。 我想用熊猫方法解决这个问题

我试图通过字典来解决它,但是这种方式很复杂。

dict_block = dict((key, []) for key in np.unique(df.Block))
for idx, row in enumerate(seq):
    block = df.Block[idx]
    dict_seq = dict((key, []) for key in np.unique(row.split(' ')))
    for idy, key in enumerate(row.split(' ')):
        item = df.Time[idx].split(' ')[idy]
        dict_seq[key].append(item)
    dict_block[block].append(dict_seq)

1)例如:

blk_-105450231192318816 : 
    13: 1257712532.0, 1257712532.0, 1257712532.0
    15: 1257712532.0

2)例如:

blk_-105450231192318816 : 
    13: 0, (1257712532.0 - 1257712532.0) = 0, (1257712532.0 - 1257712532.0) = 0
    15: 0

字典尝试的输出:

{'blk_-105450231192318816': 
[{'13': ['1257712532.0', '1257712532.0','1257712532.0'],
'15': ['1257712532.0']}],
'blk_-1076549517733373559': 
[{'13': ['1257712534.0', '1257712534.0'],
'15': ['1257712533.0']}],
'blk_-1187723472581877455': 
[{'13': ['1257712533.0', '1257712533.0'],
'15': ['1257712533.0']}],
'blk_-1385756122847916710': 
[{'13': ['1257712532.0',
'1257712532.0',
'1257712534.0'],
'15': ['1257712532.0']}],
'blk_-1470784088028862059': 
[{'13': ['1257712535.0']}]}

摘要:

我想用熊猫,numpy方法解决下一点:

1)分组列

2)获取时间增量(t1-t0)

等待您的评论:)

1 个答案:

答案 0 :(得分:1)

解决方案1:使用字典

如果您更喜欢使用字典,则可以使用apply和自定义方法来处理字典。

df是您提供的示例数据框。在这里,我做了两种方法。我希望代码足够清晰,可以理解。

def grouping(x):
    """Make a dictionary combining 'Seq' and 'Time' columns.

    'Seq' elements are the keys, 'Time' are the values. 'Time' elements
    corresponding to the same key are stored in a list.
    """
    #splitting the string and make it numeric
    keys = list(map(int, x['Seq'].split()))
    times = list(map(float, x['Time'].split()))

    #building the result dictionary.
    res = {}
    for i, k in enumerate(keys):
        try:
            res[k].append(times[i])
        except KeyError:
            res[k] = [times[i]]

    return res    


def timediffs(x):
    """Make a dictionary starting from 'GroupedSeq' column, which can
    be created with the grouping function.

    It contains the difference between the times of each key.
    """
    ddt = x['GroupedSeq']
    res = {}
    #iterating over the dictionary to calculate the differences.
    for k, v in ddt.items():
        res[k] = [0.0] + [t1 - t0 for t0, t1 in zip(v[:-1], v[1:])]
    return res  

df['GroupedSeq'] = df.apply(grouping, axis=1)
df['difftimes'] = df.apply(timediffs, axis=1)

apply的作用是将函数应用于每一行。结果存储在数据框的新列中。现在df包含两个新列,您可以通过执行以下操作来删除原始的'Seq'Time列:df.drop(['Seq', 'Time'], axis=1, inplace=True)。最后,df如下:

                      Block                                            grouped                         difftimes
0   blk_-105450231192318816  {13: [1257712532.0, 1257712532.0, 1257712532.0...  {13: [0.0, 0.0, 0.0], 15: [0.0]}
1  blk_-1076549517733373559  {15: [1257712533.0], 13: [1257712534.0, 125771...       {15: [0.0], 13: [0.0, 0.0]}
2  blk_-1187723472581877455  {13: [1257712533.0, 1257712533.0], 15: [125771...       {13: [0.0, 0.0], 15: [0.0]}
3  blk_-1385756122847916710  {13: [1257712532.0, 1257712532.0, 1257712534.0...  {13: [0.0, 0.0, 2.0], 15: [0.0]}
4  blk_-1470784088028862059                               {13: [1257712535.0]}                       {13: [0.0]}

如您所见,这里pandas本身仅用于应用自定义方法,但是在这些方法中,有正常的python代码在起作用。


解决方案2:没有字典,更多的熊猫

如果您在数据框中存储列表或字典,熊猫本身不是很有用。因此,我提出了一种不带词典的解决方案。我将groupbyapply结合使用,根据所选行的值对它们进行操作。
groupby根据一列或多列的值选择数据帧的子样本:将这些列中具有相同值的所有行进行分组,然后对该子样本执行方法或操作。

同样,df是您提供的示例数据框。

df1 = df.copy() #working on a copy, not really needed but I wanted to preserve the original

##splitting the string and make it a numeric list using apply
df1['Seq'] = df1['Seq'].apply(lambda x : list(map(int, x.split())))
df1['Time'] = df1['Time'].apply(lambda x : list(map(float, x.split())))

#for each index in 'Block', unnest the list in 'Seq' making it a secodary index. 
df2 = df1.groupby('Block').apply(lambda x : pd.DataFrame([[e] for e in x['Time'].iloc[0]], index=x['Seq'].tolist()))
#resetting index and renaming column names created by pandas
df2 = df2.reset_index().rename(columns={'level_1':'Seq', 0:'Time'})

#custom method to store the differences between times.
def timediffs(x):
    x['tdiff'] = x['Time'].diff().fillna(0.0)
    return x

df3 = df2.groupby(['Block', 'Seq']).apply(timediffs)

最后的df3是:

                       Block      Seq          Time  tdiff
0    blk_-105450231192318816       13  1.257713e+09    0.0
1    blk_-105450231192318816       13  1.257713e+09    0.0
2    blk_-105450231192318816       13  1.257713e+09    0.0
3    blk_-105450231192318816       15  1.257713e+09    0.0
4   blk_-1076549517733373559       15  1.257713e+09    0.0
5   blk_-1076549517733373559       13  1.257713e+09    0.0
6   blk_-1076549517733373559       13  1.257713e+09    0.0
7   blk_-1187723472581877455       13  1.257713e+09    0.0
8   blk_-1187723472581877455       13  1.257713e+09    0.0
9   blk_-1187723472581877455       15  1.257713e+09    0.0
10  blk_-1385756122847916710       13  1.257713e+09    0.0
11  blk_-1385756122847916710       13  1.257713e+09    0.0
12  blk_-1385756122847916710       15  1.257713e+09    0.0
13  blk_-1385756122847916710       13  1.257713e+09    2.0
14  blk_-1470784088028862059       13  1.257713e+09    0.0

如您所见,数据框内没有字典。您在'Block''Seq'列中有重复项,但这是不可避免的。