Question

我有数据集：

dat = {'Block': ['blk_-105450231192318816', 'blk_-1076549517733373559', 'blk_-1187723472581877455', 'blk_-1385756122847916710',  'blk_-1470784088028862059'], 'Seq': ['13 13 13 15',' 15 13 13', '13 13 15', '13 13 15 13', '13'], 'Time' : ['1257712532.0 1257712532.0 1257712532.0 1257712532.0','1257712533.0 1257712534.0 1257712534.0','1257712533.0 1257712533.0 1257712533.0','1257712532.0 1257712532.0 1257712532.0 1257712534.0','1257712535.0']}

df = pd.DataFrame(data = dat)

块是ID。 Seq是ID。时间是unix格式的时间。我想更改列或创建新列。

1）我需要通过两列元素的索引来连接Seq和Time列。

2）在我想获取“时间”列（下一个元素-上一个）和第一个元素设置为零的增量之后。

最后，从不同的块写入文件行，但是巫婆具有相同的Seq-id。我想用熊猫方法解决这个问题

我试图通过字典来解决它，但是这种方式很复杂。

dict_block = dict((key, []) for key in np.unique(df.Block))
for idx, row in enumerate(seq):
    block = df.Block[idx]
    dict_seq = dict((key, []) for key in np.unique(row.split(' ')))
    for idy, key in enumerate(row.split(' ')):
        item = df.Time[idx].split(' ')[idy]
        dict_seq[key].append(item)
    dict_block[block].append(dict_seq)

1）例如：

blk_-105450231192318816 : 
    13: 1257712532.0, 1257712532.0, 1257712532.0
    15: 1257712532.0

2）例如：

blk_-105450231192318816 : 
    13: 0, (1257712532.0 - 1257712532.0) = 0, (1257712532.0 - 1257712532.0) = 0
    15: 0

字典尝试的输出：

{'blk_-105450231192318816': 
[{'13': ['1257712532.0', '1257712532.0','1257712532.0'],
'15': ['1257712532.0']}],
'blk_-1076549517733373559': 
[{'13': ['1257712534.0', '1257712534.0'],
'15': ['1257712533.0']}],
'blk_-1187723472581877455': 
[{'13': ['1257712533.0', '1257712533.0'],
'15': ['1257712533.0']}],
'blk_-1385756122847916710': 
[{'13': ['1257712532.0',
'1257712532.0',
'1257712534.0'],
'15': ['1257712532.0']}],
'blk_-1470784088028862059': 
[{'13': ['1257712535.0']}]}

摘要：

我想用熊猫，numpy方法解决下一点：

1）分组列

2）获取时间增量（t1-t0）

等待您的评论：）

Answer 1

解决方案1：使用字典

如果您更喜欢使用字典，则可以使用apply和自定义方法来处理字典。

df是您提供的示例数据框。在这里，我做了两种方法。我希望代码足够清晰，可以理解。

def grouping(x):
    """Make a dictionary combining 'Seq' and 'Time' columns.

    'Seq' elements are the keys, 'Time' are the values. 'Time' elements
    corresponding to the same key are stored in a list.
    """
    #splitting the string and make it numeric
    keys = list(map(int, x['Seq'].split()))
    times = list(map(float, x['Time'].split()))

    #building the result dictionary.
    res = {}
    for i, k in enumerate(keys):
        try:
            res[k].append(times[i])
        except KeyError:
            res[k] = [times[i]]

    return res    


def timediffs(x):
    """Make a dictionary starting from 'GroupedSeq' column, which can
    be created with the grouping function.

    It contains the difference between the times of each key.
    """
    ddt = x['GroupedSeq']
    res = {}
    #iterating over the dictionary to calculate the differences.
    for k, v in ddt.items():
        res[k] = [0.0] + [t1 - t0 for t0, t1 in zip(v[:-1], v[1:])]
    return res  

df['GroupedSeq'] = df.apply(grouping, axis=1)
df['difftimes'] = df.apply(timediffs, axis=1)

apply的作用是将函数应用于每一行。结果存储在数据框的新列中。现在df包含两个新列，您可以通过执行以下操作来删除原始的'Seq'和Time列：df.drop(['Seq', 'Time'], axis=1, inplace=True)。最后，df如下：

                      Block                                            grouped                         difftimes
0   blk_-105450231192318816  {13: [1257712532.0, 1257712532.0, 1257712532.0...  {13: [0.0, 0.0, 0.0], 15: [0.0]}
1  blk_-1076549517733373559  {15: [1257712533.0], 13: [1257712534.0, 125771...       {15: [0.0], 13: [0.0, 0.0]}
2  blk_-1187723472581877455  {13: [1257712533.0, 1257712533.0], 15: [125771...       {13: [0.0, 0.0], 15: [0.0]}
3  blk_-1385756122847916710  {13: [1257712532.0, 1257712532.0, 1257712534.0...  {13: [0.0, 0.0, 2.0], 15: [0.0]}
4  blk_-1470784088028862059                               {13: [1257712535.0]}                       {13: [0.0]}

如您所见，这里pandas本身仅用于应用自定义方法，但是在这些方法中，有正常的python代码在起作用。

解决方案2：没有字典，更多的熊猫

如果您在数据框中存储列表或字典，熊猫本身不是很有用。因此，我提出了一种不带词典的解决方案。我将groupby与apply结合使用，根据所选行的值对它们进行操作。
groupby根据一列或多列的值选择数据帧的子样本：将这些列中具有相同值的所有行进行分组，然后对该子样本执行方法或操作。

同样，df是您提供的示例数据框。

df1 = df.copy() #working on a copy, not really needed but I wanted to preserve the original

##splitting the string and make it a numeric list using apply
df1['Seq'] = df1['Seq'].apply(lambda x : list(map(int, x.split())))
df1['Time'] = df1['Time'].apply(lambda x : list(map(float, x.split())))

#for each index in 'Block', unnest the list in 'Seq' making it a secodary index. 
df2 = df1.groupby('Block').apply(lambda x : pd.DataFrame([[e] for e in x['Time'].iloc[0]], index=x['Seq'].tolist()))
#resetting index and renaming column names created by pandas
df2 = df2.reset_index().rename(columns={'level_1':'Seq', 0:'Time'})

#custom method to store the differences between times.
def timediffs(x):
    x['tdiff'] = x['Time'].diff().fillna(0.0)
    return x

df3 = df2.groupby(['Block', 'Seq']).apply(timediffs)

最后的df3是：

                       Block      Seq          Time  tdiff
0    blk_-105450231192318816       13  1.257713e+09    0.0
1    blk_-105450231192318816       13  1.257713e+09    0.0
2    blk_-105450231192318816       13  1.257713e+09    0.0
3    blk_-105450231192318816       15  1.257713e+09    0.0
4   blk_-1076549517733373559       15  1.257713e+09    0.0
5   blk_-1076549517733373559       13  1.257713e+09    0.0
6   blk_-1076549517733373559       13  1.257713e+09    0.0
7   blk_-1187723472581877455       13  1.257713e+09    0.0
8   blk_-1187723472581877455       13  1.257713e+09    0.0
9   blk_-1187723472581877455       15  1.257713e+09    0.0
10  blk_-1385756122847916710       13  1.257713e+09    0.0
11  blk_-1385756122847916710       13  1.257713e+09    0.0
12  blk_-1385756122847916710       15  1.257713e+09    0.0
13  blk_-1385756122847916710       13  1.257713e+09    2.0
14  blk_-1470784088028862059       13  1.257713e+09    0.0

如您所见，数据框内没有字典。您在'Block'和'Seq'列中有重复项，但这是不可避免的。

重塑熊猫数据框并使用列

1 个答案:

解决方案1：使用字典

解决方案2：没有字典，更多的熊猫