如何及时有效地删除“NaN”值旁边的值?

时间:2017-03-31 05:44:11

标签: python performance for-loop time

我正在尝试从我的数据中删除错误的值(一系列15mln值,700MB)。要删除的值是“nan”值旁边的值,例如:

系列:/1/,nan,/2/,3,/4/,nan,nan,nan,/8/,9 斜杠包围的数字,即/ 1 /,/ 2 /,/ 4 /,/ 8 /是值,应删除。

问题是使用以下代码计算它需要花费太长时间:

%%time

import numpy as np
import pandas as pd

# sample data
speed = np.random.uniform(0,25,15000000)
next_speed = speed[1:]

# create a dataframe
data_dict = {'speed': speed[:-1],
            'next_speed': next_speed}

df = pd.DataFrame(data_dict)


# calculate difference between the current speed and the next speed
list_of_differences = []

for i in df.index:
    difference = df.next_speed[i]-df.speed[i]
    list_of_differences.append(difference)

df['difference'] = list_of_differences

# add 'nan' to data in form of a string. 

for i in range(len(df.difference)):
    # arbitrary condition
    if df.difference[i] < -2:
        df.difference[i] = 'nan'

#########################################
# THE TIME-INEFFICIENT LOOP

# remove wrong values before and after 'nan'.
for i in  range(len(df)):

    # check if the value is a number to skip computations of the following "if" cases
    if not(isinstance(df.difference[i], str)):
        continue

    # case 1: where there's only one 'nan' surrounded by values. 
    # Without this case the algo will miss some wrong values because 'nan' will be removed
    # Example of a series: /1/,nan,/2/,3,4,nan,nan,nan,8,9
    # A number surrounded by slashes e.g. /1/ is a value to be removed
    if df.difference[i] == 'nan' and df.difference[i-1] != 'nan' and df.difference[i+1] != 'nan':
        df.difference[i-1]= 'wrong'
        df.difference[i+1]= 'wrong'

    # case 2: where the following values are 'nan': /1/, nan, nan, 4
    # E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,8,9
    elif df.difference[i] == 'nan' and df.difference[i+1] == 'nan':
        df.difference[i-1]= 'wrong'

    # case 3: where next value is NOT 'nan'  wrong, nan,nan,4 
        # E.g.: /1/, nan,/2/,3,/4/,nan,nan,nan,/8/,9
    elif df.difference[i] == 'nan' and df.difference[i+1] != 'nan':
        df.difference[i+1]= 'wrong'

如何让它更节省时间?

2 个答案:

答案 0 :(得分:1)

这对我来说仍然是一项进展中的工作。我将您的虚拟数据大小减少了100倍,以达到我可以等待的程度。

我还在我的版本顶部添加了此代码:

 import time

current_milli_time = lambda: int(round(time.time() * 1000))

def mark(s):
    print("[{}] {}".format(current_milli_time()/1000, s))

这只是打印一个带有时间标记的字符串,看看它花了这么长时间。

完成后,在'difference'列计算中,您可以使用向量运算替换手动列表生成。这段代码:

df = pd.DataFrame(data_dict)

mark("Got DataFrame")

# calculate difference between the current speed and the next speed
list_of_differences = []

for i in df.index:
    difference = df.next_speed[i]-df.speed[i]
    list_of_differences.append(difference)

df['difference'] = list_of_differences
mark("difference 1")

df['difference2'] = df['next_speed'] - df['speed']
mark('difference 2')

print(df[:10])

生成此输出:

[1490943913.921] Got DataFrame
[1490943922.094] difference 1
[1490943922.096] difference 2
   next_speed      speed  difference  difference2
0   18.008314  20.182982   -2.174669    -2.174669
1   14.736095  18.008314   -3.272219    -3.272219
2    5.352993  14.736095   -9.383102    -9.383102
3    5.854199   5.352993    0.501206     0.501206
4    2.003826   5.854199   -3.850373    -3.850373
5   12.736061   2.003826   10.732236    10.732236
6    2.512623  12.736061  -10.223438   -10.223438
7   18.224716   2.512623   15.712093    15.712093
8   14.023848  18.224716   -4.200868    -4.200868
9   15.991590  14.023848    1.967741     1.967741

请注意,两个差异列是相同的,但第二个版本花费的时间减少了大约8秒。 (当你有100倍的数据时,可能是800秒。)

我在&#39; nanify&#39;中做了同样的事情。过程:

df.difference2[df.difference2 < -2] = np.nan

这里的想法是许多二元运算符实际上生成占位符,或系列或向量。并且可以将其用作索引,以便df.difference2 < -2成为(实质上)该条件为真的位置的列表,然后您可以索引df(整个表)或任何df的列,如df.difference2,使用该索引。它是速度慢的python for循环的快速简写。

<强>更新

好的,最后,这是一个矢量化&#34; Time-inefficient Loop&#34;的版本。我只是将整个事情粘贴在底部,用于复制。

前提是Series.isnull()方法返回一个布尔系列(列),如果内容是&#34;则缺少&#34;或&#34;无效&#34;或&#34;虚假。&#34;通常,这意味着NaN,但它也识别Python无等等。

在熊猫中,棘手的部分是将该列向上或向下移动一个以反映&#34;周围的情况。

也就是说,我想要另一个布尔列,如果col [n]为null,则col [n-1]为真。这是我在&#34;之前的一个&#34;柱。同样,如果col [n]为null,我想要另一列col [n + 1]为真。这是我的&#34;列。

事实证明我不得不将该死的东西分开!我必须进入,使用numpy属性提取基础Series.values数组,以便丢弃pandas index 。然后创建一个新的索引,从0开始,一切都有效。 (如果你不删除索引,那么列&#34;记住&#34;它们的数字应该是什么。所以即使你删除了列[0],该列也不会向下移动。相反,我知道&#34;我错过了我的[0]值,但其他人仍然在正确的位置!&#34;)

无论如何,有了这个想法,我能够构建三列(不必要 - 它们可能是表达式的一部分)然后将它们合并到第四列中,指示您想要的内容:列为{{1当行在True值之前,之后或之后。

nan

以下是整个事情:

missing = df.difference2.isnull()
df['is_nan'] = missing
df['before_nan'] = np.append(missing[1:].values, False)
df['after_nan'] = np.insert(missing[:-1].values, 0, False)
df['around_nan'] = df.is_nan | df.before_nan | df.after_nan

答案 1 :(得分:0)

我假设您不想要'nan'或错误的值,而nan值与数据大小相比并不多。请试试这个:

nan_idx = df[df['difference']=='nan'].index.tolist()

from copy import deepcopy
drop_list = deepcopy(nan_idx)


for i in nan_idx:
    if (i+1) not in(drop_list) and (i+1) < len(df):
        mm.append(i+1)
    if (i-1) not in(drop_list) and (i-1) < len(df):
        mm.append(i-1)

df.drop(df.index[drop_list])

如果nan不是字符串,但是NaN是缺失值,那么使用它来获取其索引:

nan_idx = df[pandas.isnull(df['difference'])].index.tolist()