熊猫数据框

时间:2017-09-29 21:17:20

标签: python pandas dataframe

我对以下代码有疑问, 我有一个数据集和一个列表,我想比较我的数据集的每个数据值有两个条件,如果条件为真,那么保持数据帧的先前值,否则使其为无,我的代码适用于小数据集但是我的大数据集需要花费太多时间而没有任何值。有更好的解决方案吗?

new_data=data
    for col in df.columns:
        for i in range(len(df)):
            if (df.iloc[i][col] >list_min[i] ) & (df.iloc[i][col]<list_max[i]):
                new_data.set_value(i,col,df.iloc[i][col])
            else:
                new_data.set_value(i,col,None)

感谢您的评论或其他解决方案。

这是我的代码不起作用:

data = pd.read_csv('./dataset/w.csv')
i=0
data = data.applymap(np.log)
data = data.drop('time', axis=1)
q75_list = []
q25_list = []
iqr_list = []
min_list = []
max_list = []
new_data=data
for col in data.columns.values:
    q75_list.append(np.nanpercentile(data[col], 75))
    q25_list.append(np.nanpercentile(data[col], 25))

    iqr_list = np.array(q75_list) - np.array(q25_list)
    min_list = np.array(q25_list) - (np.array(iqr_list * 1.5))
    max_list = np.array(q75_list) + (np.array(iqr_list * 1.5))

print("Max :\n",max_list,"\n Min :\n",min_list)

for col in data.columns:
    for (i, j) in [(i, j) for i in range(len(data)) for j in range(len(min_list))]:

        if (data.iloc[i][col] >min_list[j] ) & (data.iloc[i][col]<max_list[j]):
            new_data.set_value(i,col,data.iloc[i][col])
        else:
            new_data.set_value(i,col,None)

new_data.to_csv('./dataset/result.csv',index=False)

2 个答案:

答案 0 :(得分:1)

使用pandas.DataFrame.loc考虑​​ if / then / else 习语。下面假设 list_min list_max data 中行数相等的列表。

for col in data.columns:
    new_data.loc[(data[col] > pd.Series(list_min)) & 
                 (data[col] < pd.Series(list_max)), col] = data[col] 

    new_data.loc[(data[col] < pd.Series(list_min)) |
                 (data[col] > pd.Series(list_max)), col] = None

用示例10个cols和50行的随机数据进行演示(播种再现):

数据

import pandas as pd
import numpy as np

pd.set_option('display.width', 1000)

np.random.seed(107)
data = pd.DataFrame([[np.random.randint(50) for _ in range(50)] for _ in range(10)]).T
print(data.head())
#     0   1   2   3   4   5   6   7   8   9
# 0  48  17  37  22   1   0   6  14  33  10
# 1  25  38  28   4  36  22   4  24  28  49
# 2   6   5  22  35  14  14  40  41  38  26
# 3  14  43   5  31  38  45  40   5  32   1
# 4  11  30  35  32  20  37  26  39  34   5

list_min = [np.random.randint(50) for _ in range(50)]
print(list_min[:10])
# [37, 17, 33, 24, 0, 46, 11, 4, 25, 41]
list_max = [np.random.randint(50) for _ in range(50)]
print(list_max[:10])
# [45, 37, 49, 38, 31, 9, 20, 39, 7, 36]

<强>操作

new_data = data.loc[:,]

for col in data.columns:
        new_data.loc[(data[col] > pd.Series(list_min)) & 
                     (data[col] < pd.Series(list_max)), col] = data[col] 

        new_data.loc[(data[col] < pd.Series(list_min)) |
                     (data[col] > pd.Series(list_max)), col] = None

print(new_data.head())
#       0     1     2     3     4     5     6     7     8    9
# 0   NaN   NaN  37.0   NaN   NaN   NaN   NaN   NaN   NaN  NaN
# 1  25.0   NaN  28.0   NaN  36.0  22.0   NaN  24.0  28.0  NaN
# 2   NaN   NaN   NaN  35.0   NaN   NaN  40.0  41.0  38.0  NaN
# 3   NaN   NaN   NaN  31.0  38.0   NaN   NaN   NaN  32.0  NaN
# 4  11.0  30.0   NaN   NaN  20.0   NaN  26.0   NaN   NaN  5.0

答案 1 :(得分:1)

如果我正确地理解你在做什么,有几个地方你可以尝试对事物进行矢量化。看看这是否加快了速度:

q75s = data.quantile(.75)
q25s = data.quantile(.25)
mins = 2.5*q25s - 1.5*q75s
maxs = 2.5*q75s - 1.5*q25s

newdata = data.copy()
newdata[(data < mins) | (data > maxs)] = None