通过熊猫数据框的可能优化

时间:2018-08-09 21:17:01

标签: python pandas dataframe

我正在尝试找到一种优化通过熊猫数据框循环的方法。数据集包含约450k行和约20列。数据框包含3个作为多索引的位置变量,我想删除组中存在NaN列的行,否则用组的平均值填充NaN。

LOC = ['market_id', 'midmarket_id', 'submarket_id']

# Assign -1000 to multiindex nan values
df = df.fillna({c:-1000 for c in LOC})
df = df.set_index(LOC).sort_index(level=[i for i in range(len(LOC))])

# Looping through subset with same (market, midmarket, submarket)
for k, v in df.copy().groupby(level=[i for i in range(len(LOC))]):

    # If there is any column with all NaN value, drop it from df
    if v.isnull().all().any():
        df.drop(v.index.values)

    # If there is at least one non-NaN value, fillna with mean
    else:
        df.loc[v.index.values] = df.loc[v.index.values].fillna(v.mean())

所以如果有这样的数据框 before 并且应该这样转换,删除所有NaN列的行 after

很抱歉,如果这是多余的,或者不符合堆栈溢出问题指南的要求。但是,如果有人对此有更好的解决方案,我将不胜感激。

谢谢。

1 个答案:

答案 0 :(得分:0)

无需复制整个数据框。也不需要手动迭代Sub NEWWORK() Dim sheet As Worksheet Dim a As String Dim B As String a = Range("L1").End(xlDown).Address B = Range("L1048576").End(xlUp).Address Sheets("Summary").Activate For Each sheet In Worksheets If (Left(sheet.Name, 4) = "2018") Or (Left(sheet.Name, 4) = "2017") Then sheet.Select sheet.Range(a, B).Select Range(Selection, Selection.End(xlToLeft)).Copy Worksheets("Summary").Cells(Rows.Count, 1).End(xlUp).Offset(-1, 0).PasteSpecial (xlPasteValues) End If Next sheet End Sub 元素。这是另一种解决方案:

GroupBy