我对以下代码有疑问, 我有一个数据集和一个列表,我想比较我的数据集的每个数据值有两个条件,如果条件为真,那么保持数据帧的先前值,否则使其为无,我的代码适用于小数据集但是我的大数据集需要花费太多时间而没有任何值。有更好的解决方案吗?
new_data=data
for col in df.columns:
for i in range(len(df)):
if (df.iloc[i][col] >list_min[i] ) & (df.iloc[i][col]<list_max[i]):
new_data.set_value(i,col,df.iloc[i][col])
else:
new_data.set_value(i,col,None)
感谢您的评论或其他解决方案。
这是我的代码不起作用:
data = pd.read_csv('./dataset/w.csv')
i=0
data = data.applymap(np.log)
data = data.drop('time', axis=1)
q75_list = []
q25_list = []
iqr_list = []
min_list = []
max_list = []
new_data=data
for col in data.columns.values:
q75_list.append(np.nanpercentile(data[col], 75))
q25_list.append(np.nanpercentile(data[col], 25))
iqr_list = np.array(q75_list) - np.array(q25_list)
min_list = np.array(q25_list) - (np.array(iqr_list * 1.5))
max_list = np.array(q75_list) + (np.array(iqr_list * 1.5))
print("Max :\n",max_list,"\n Min :\n",min_list)
for col in data.columns:
for (i, j) in [(i, j) for i in range(len(data)) for j in range(len(min_list))]:
if (data.iloc[i][col] >min_list[j] ) & (data.iloc[i][col]<max_list[j]):
new_data.set_value(i,col,data.iloc[i][col])
else:
new_data.set_value(i,col,None)
new_data.to_csv('./dataset/result.csv',index=False)
答案 0 :(得分:1)
使用pandas.DataFrame.loc考虑 if / then / else 习语。下面假设 list_min 和 list_max 是 data 中行数相等的列表。
for col in data.columns:
new_data.loc[(data[col] > pd.Series(list_min)) &
(data[col] < pd.Series(list_max)), col] = data[col]
new_data.loc[(data[col] < pd.Series(list_min)) |
(data[col] > pd.Series(list_max)), col] = None
用示例10个cols和50行的随机数据进行演示(播种再现):
数据强>
import pandas as pd
import numpy as np
pd.set_option('display.width', 1000)
np.random.seed(107)
data = pd.DataFrame([[np.random.randint(50) for _ in range(50)] for _ in range(10)]).T
print(data.head())
# 0 1 2 3 4 5 6 7 8 9
# 0 48 17 37 22 1 0 6 14 33 10
# 1 25 38 28 4 36 22 4 24 28 49
# 2 6 5 22 35 14 14 40 41 38 26
# 3 14 43 5 31 38 45 40 5 32 1
# 4 11 30 35 32 20 37 26 39 34 5
list_min = [np.random.randint(50) for _ in range(50)]
print(list_min[:10])
# [37, 17, 33, 24, 0, 46, 11, 4, 25, 41]
list_max = [np.random.randint(50) for _ in range(50)]
print(list_max[:10])
# [45, 37, 49, 38, 31, 9, 20, 39, 7, 36]
<强>操作强>
new_data = data.loc[:,]
for col in data.columns:
new_data.loc[(data[col] > pd.Series(list_min)) &
(data[col] < pd.Series(list_max)), col] = data[col]
new_data.loc[(data[col] < pd.Series(list_min)) |
(data[col] > pd.Series(list_max)), col] = None
print(new_data.head())
# 0 1 2 3 4 5 6 7 8 9
# 0 NaN NaN 37.0 NaN NaN NaN NaN NaN NaN NaN
# 1 25.0 NaN 28.0 NaN 36.0 22.0 NaN 24.0 28.0 NaN
# 2 NaN NaN NaN 35.0 NaN NaN 40.0 41.0 38.0 NaN
# 3 NaN NaN NaN 31.0 38.0 NaN NaN NaN 32.0 NaN
# 4 11.0 30.0 NaN NaN 20.0 NaN 26.0 NaN NaN 5.0
答案 1 :(得分:1)
如果我正确地理解你在做什么,有几个地方你可以尝试对事物进行矢量化。看看这是否加快了速度:
q75s = data.quantile(.75)
q25s = data.quantile(.25)
mins = 2.5*q25s - 1.5*q75s
maxs = 2.5*q75s - 1.5*q25s
newdata = data.copy()
newdata[(data < mins) | (data > maxs)] = None