Python csv通过单元格迭代来查找更大的值并对该行执行删除

时间:2017-04-06 07:00:36

标签: python pandas dataframe comparison where

我尝试执行的操作类似于mysql delete语句:

    DELETE FROM ABCD WHERE val_2001>val_2000*1.5 OR val_2001>val_1999*POW(1.5,2);

其中val_2001,val_2000,val_1999都是列名。因此查询正在执行以下3个操作:

1. Comparing col-b with col-a 
2. OR operation with comparing col-b with col-1999(constant)
3. Deleting the whole row from the table if the condition satisfies.

在python中写这个(而不是mysql,因为它是一个csv并且避免上传到db)。 我现在的代码如下:

   df = pd.read_csv("singleDataFile.csv")
       for values in xrange(2000,2016):
            val2 = values+1
            df['val_'+str(val2)] = df['val_'+str(val2)].where((df['val_'+str(val2)]>df['val_'+str(values)]*1.5) |  (df['val_'+str(val2)]<df['val_'+str(values)]*0.75))

       print(df)

尝试了替代方法:

    df = pd.read_csv("singleDataFile.csv")
    cols = [ 'val_{}'.format(c) for c in range(2000, 2018)]
    df = pd.DataFrame(df, columns = cols)
    df[(df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)] = 'NULL'

在这两种情况下,它都以:

结束
getting an error with TypeError: can't multiply sequence by non-int of type 'float'

然而,在这两种方式中它甚至都没有尝试删除整行。怎么实现这个?

CSV TABLE SNIPPET:

val_2000   val_2001        val_2002            val_2003
100     112.058663384525    119.070787312921    117.033250060214
100     118.300395256917    124.655238202362    128.723125524235
100     109.333236619151    116.785836024946    117.390803371386
100     120.954175930764    126.099776250454    124.491022271481
100     107.776153227575    105.560100052722    108.07490649383
100     151.596517146962    306.608812920781    124.610273175528

注意:val_2000之前的列有索引行和一些名称行,也不应该考虑进行迭代。

1 个答案:

答案 0 :(得分:1)

您似乎需要any来检查至少一个True,然后按~转换并按boolean indexing过滤:

#convert all values to float
df = df.astype(float)

#if some bad values (like strings in numeric) replace them to NaN
#df = df.apply(pd.to_numeric, errors='coerce')

print ((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75))
  val_2000 val_2001 val_2002 val_2003
0    False    False    False    False
1    False    False    False    False
2    False    False    False    False
3    False    False    False    False
4    False    False    False    False
5    False    False     True     True

print (~((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)).any(1))
0     True
1     True
2     True
3     True
4     True
5    False
dtype: bool

df = df[~((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)).any(1)]
print (df)
   val_2000    val_2001    val_2002    val_2003
0       100  112.058663  119.070787  117.033250
1       100  118.300395  124.655238  128.723126
2       100  109.333237  116.785836  117.390803
3       100  120.954176  126.099776  124.491022
4       100  107.776153  105.560100  108.074906

你需要的IIUC:

const = ['val_'+ str(x) for x in range(1995,2000)]
print (const)
['val_1995', 'val_1996', 'val_1997', 'val_1998', 'val_1999']

for x in const:
    df[x] = 1