我正在尝试找到一种优化通过熊猫数据框循环的方法。数据集包含约450k行和约20列。数据框包含3个作为多索引的位置变量,我想删除组中存在NaN列的行,否则用组的平均值填充NaN。
LOC = ['market_id', 'midmarket_id', 'submarket_id']
# Assign -1000 to multiindex nan values
df = df.fillna({c:-1000 for c in LOC})
df = df.set_index(LOC).sort_index(level=[i for i in range(len(LOC))])
# Looping through subset with same (market, midmarket, submarket)
for k, v in df.copy().groupby(level=[i for i in range(len(LOC))]):
# If there is any column with all NaN value, drop it from df
if v.isnull().all().any():
df.drop(v.index.values)
# If there is at least one non-NaN value, fillna with mean
else:
df.loc[v.index.values] = df.loc[v.index.values].fillna(v.mean())
所以如果有这样的数据框 before 并且应该这样转换,删除所有NaN列的行 after。
很抱歉,如果这是多余的,或者不符合堆栈溢出问题指南的要求。但是,如果有人对此有更好的解决方案,我将不胜感激。
谢谢。
答案 0 :(得分:0)
无需复制整个数据框。也不需要手动迭代Sub NEWWORK()
Dim sheet As Worksheet
Dim a As String
Dim B As String
a = Range("L1").End(xlDown).Address
B = Range("L1048576").End(xlUp).Address
Sheets("Summary").Activate
For Each sheet In Worksheets
If (Left(sheet.Name, 4) = "2018") Or (Left(sheet.Name, 4) = "2017") Then
sheet.Select
sheet.Range(a, B).Select
Range(Selection, Selection.End(xlToLeft)).Copy
Worksheets("Summary").Cells(Rows.Count, 1).End(xlUp).Offset(-1, 0).PasteSpecial (xlPasteValues)
End If
Next sheet
End Sub
元素。这是另一种解决方案:
GroupBy