Question

我有一个pandas数据框，在这里我试图在几天内将重复值替换/更改为0（不想删除值）。

因此，在下面给出的示例中，我想在3天（可以更改数字）天的范围内用0替换所有列中的重复值。所需结果也在下面

给出

Option Explicit

Public Sub UpdateData()
    Dim WsDest As Worksheet 'destination workbook to write in
    Set WsDest = ThisWorkbook.Worksheets("Consolidated Tracker")

    Dim WsSrc As Worksheet 'source workbook to match with
    Set WsSrc = ThisWorkbook.Worksheets("May 2018")

    Dim LastRow As Long 'last used row in workbook
    LastRow = WsDest.Cells(WsDest.Rows.Count, "B").End(xlUp).Row

    Dim iRow As Long, MatchedRow As Long
    For iRow = 1 To LastRow 'loop through all rows from row 1 to last used row and update each row
        MatchedRow = 0 'initialize
        On Error Resume Next 'if no match found then ignore error
        MatchedRow = WorksheetFunction.Match(WsDest.Cells(iRow, "B"), WsSrc.Columns("B"), 0) 'get the row number of the match
        On Error GoTo 0 'reactivate error reporting

        If MatchedRow > 0 Then 'if a match was found then copy values
            WsDest.Cells(iRow, "C").Value = WsSrc.Cells(MatchedRow, "C").Value

        End If
    Next iRow
End Sub

因此，输出应类似于

              A   B  C

01-01-2011   2   10  0
01-02-2011   2   12  2
01-03-2011   2   10  0
01-04-2011   3   11  3
01-05-2011   5   15  0
01-06-2011   5   23  1
01-07-2011   4   21  4
01-08-2011   2   21  5
01-09-2011   1   11  0

任何帮助将不胜感激。

Answer 1

为此，您可以使用df.shift（）从上一行或下一行（或几行，由.shift（x）中的数字x指定）查看一个值。

您可以将其与.loc结合使用，以选择与上述2行具有相同值的所有行，然后将其替换为0。

类似的事情应该起作用：（对代码进行了编辑，使其可以灵活地处理无数列，并可以灵活地处理天数）

numberOfDays = 3 # number of days to compare

for col in df.columns:
    for x in range(1, numberOfDays):
        df.loc[df[col] == df[col].shift(x), col] = 0

print df

这给了我输出：

            A   B  C
date
01-01-2011  2  10  0
01-02-2011  0  12  2
01-03-2011  0   0  0
01-04-2011  3  11  3
01-05-2011  5  15  0
01-06-2011  0  23  1
01-07-2011  4  21  4
01-08-2011  2   0  5
01-09-2011  1  11  0

Answer 2

我发现没有比遍历所有列更好的了，因为每一列都会导致不同的分组。
首先定义一个函数，该函数可以在分组级别上实现您想要的功能，即将除第一个条目外的所有条目都设置为零：

def set_zeros(g):
    g.values[1:] = 0
    return g

for c in df.columns:
    df[c] = df.groupby([c, pd.Grouper(freq='3D')], as_index=False)[c].transform(set_zeros)

此自定义函数可以应用于每个组，该组由时间范围（freq='3D'）和该时间段内列的相等值定义。由于列通常在不同的行中具有相等的值，因此必须对循环中的每一列执行此操作。

出于其他考虑，将freq更改为5D，10D或20D。
有关如何定义时间段的详细说明，请参见http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

熊猫：替换/更改时间范围内的重复值

2 个答案: