编辑

Question

放置功能是否错误？需要有关如何解决此问题的专家指导：

特定日期的行可能会出现在多个文件中
我从[root@master ~]# systemctl status etcd etcd.service - Etcd Server Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled) Active: failed (Result: start-limit) since Sun 2019-06-23 12:37:06 EDT; 3min 7s ago Process: 20960 ExecStart=/bin/bash -c GOMAXPROCS=$(nproc) /usr/bin/etcd --name="${ETCD_NAME}" --data-dir="${ETCD_DATA_DIR}" --listen-client-urls="${ETCD_LISTEN_CLIENT_URLS}" (code=exited, status=1/FAILURE) Main PID: 20960 (code=exited, status=1/FAILURE) Jun 23 12:37:06 master.openshift.example.com systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE Jun 23 12:37:06 master.openshift.example.com systemd[1]: Failed to start Etcd Server. Jun 23 12:37:06 master.openshift.example.com systemd[1]: Unit etcd.service entered failed state. Jun 23 12:37:06 master.openshift.example.com systemd[1]: etcd.service failed. Jun 23 12:37:06 master.openshift.example.com systemd[1]: etcd.service holdoff time over, scheduling restart. Jun 23 12:37:06 master.openshift.example.com systemd[1]: Stopped Etcd Server. Jun 23 12:37:06 master.openshift.example.com systemd[1]: start request repeated too quickly for etcd.service Jun 23 12:37:06 master.openshift.example.com systemd[1]: Failed to start Etcd Server. Jun 23 12:37:06 master.openshift.example.com systemd[1]: Unit etcd.service entered failed state. Jun 23 12:37:06 master.openshift.example.com systemd[1]: etcd.service failed.中删除了重复的date,time行...
...然后比较一列（cs），只保留val2第一行最高的行

代码：

val2

样本数据：

cs = pd.concat([pd.read_csv(f) for f in fnames])
dp = cs[cs.duplicated(['date','time'],keep=False)]
dp = dp.sort_values(['date','time'],ascending=True)

i=0
while len(dp)>0:
    if dp.values[i][3] > dp.values[i+1][3]:
        if dp.index[i] > dp.index[i+1]:
            cs.drop(cs[(cs.date==dp.values[i][0]) & (cs.index < dp.index[i])].index, inplace=True)
            dp = cs[cs.duplicated(['date','time'],keep=False)]
            dp = dp.sort_values(['date','time'],ascending=True)

预期输出：

file,date,time,val1,val2
f1,20jun,01:00,10,210
f1,20jun,02:00,10,110
f2,20jun,01:00,10,320
f2,20jun,02:00,10,50
f2,21jun,01:00,10,130
f2,21jun,02:00,10,230

实际输出：

date,time,val1,val2
20jun,01:00,10,320
20jun,02:00,10,50
21jun,01:00,10,130
21jun,02:00,10,230

Answer 1

已编辑的答案：

在评论中讨论之后，我认为这就是您所需要的（我添加了一些代码来重现该问题）：

import pandas as pd
from io import StringIO

input_string = """file,date,time,val1,val2
f1,20jun,01:00,10,210
f1,20jun,02:00,10,110
f2,20jun,01:00,10,320
f2,20jun,02:00,10,50
f2,21jun,01:00,10,130
f2,21jun,02:00,10,230"""

buf = StringIO(input_string)
cs = pd.read_csv(buf)

def pick_file(df):
    first = df.groupby('file').first()
    file = first['val2'].idxmax()
    return df[df['file'] == file]

result = cs.groupby(['date']).apply(pick_file)

result = result.reset_index(level=0, drop=True)

结果是：

  file   date   time  val1  val2
2   f2  20jun  01:00    10   320
3   f2  20jun  02:00    10    50
4   f2  21jun  01:00    10   130
5   f2  21jun  02:00    10   230

这是一个开始：在groupby内的groupby。

外部分组按日期分组，因为这些是我们要在其中搜索文件的分组。

内部groupby在组中搜索正确的文件，并且仅保留该文件中的行。

原始答案：

您可以只使用groupby：

，而不是构造具有重复项的数据框并对其进行遍历。

cs = pd.concat([pd.read_csv(f) for f in fnames])
result = cs.groupby(['date', 'time'])\
    .apply(lambda x: x[x['val2']==x['val2'].max()])

它对date和time列中具有相同值的所有行进行分组，然后，对于每个组，它仅保留具有最高val2的行。 / p>

结果是：

              file   date   time  val1  val2
date  time                                  
20jun 01:00 2   f2  20jun  01:00    10   320
      02:00 3   f2  20jun  02:00    10   220
21jun 01:00 4   f2  21jun  01:00    10   130
      02:00 5   f2  21jun  02:00    10   230

Answer 2

要删除所需的行：

具有重复的 date 和 time ，
保留重复集中的最后一行，

使用：

cs.drop_duplicates(subset=['date', 'time'], keep='last', inplace=True)

无需对源行进行“初始排序”。

编辑

如您所写，要从每组重复的行中保留该行最高 val2 ：

将 ignore_index = True 添加到 pd.concat 中。这样，您将获得“订购” 索引，这是恢复初始行顺序所需的（最后一步）。

然后对行进行排序：

cs.sort_values(['date','time','val2'])

，以便在任何重复组中（按 date 和 time ），最高的行 val2 位于最后位置。

第三步是：

cs.drop_duplicates(subset=['date', 'time'], keep='last', inplace=True)

就像我的第一个提案一样。

最后一步，恢复行的原始顺序，再对它们进行排序，这次按索引（就地）运行：

cs.sort_index(inplace=True)

Answer 3

由于索引，“无效”行被删除。在pd.concat之后，需要使用cs.reset_index（inplace = True，drop = True）重置索引。如果没有重置索引，则每个文件的索引将从0开始。由于重复了一些索引值，因此它被drop函数删除。

尽管我可以获得正确的结果，但按日期进行的列过滤实际上仍然不起作用（cs.drop（cs [（cs.date == dp.values [0] [0]）））。它应该工作，我不必“重置”索引。还是我用错了？

感谢您的帮助。如果您有更好且优雅的方式来获得预期的输出，将不胜感激。

最诚挚的问候。

cs = pd.concat([pd.read_csv(f) for f in fnames])
cs.reset_index(inplace=True,drop=True)
dp = cs[cs.duplicated(['date','time'],keep=False)]
dp = dp.sort_values(['date','time'],ascending=True)

while len(dp)>0:
    if dp.values[0][3] > dp.values[1][3]:
        if dp.index[0] > dp.index[1]:
            cs.drop(cs[(cs.date==dp.values[0][0]) & (cs.index < dp.index[0])].index, inplace=True)
            dp.drop(dp[(dp.date==dp.values[0][0])].index, inplace=True)

对重复项的熊猫放置功能正在删除无效的行

3 个答案:

编辑