我尝试执行的操作类似于mysql delete语句:
DELETE FROM ABCD WHERE val_2001>val_2000*1.5 OR val_2001>val_1999*POW(1.5,2);
其中val_2001,val_2000,val_1999都是列名。因此查询正在执行以下3个操作:
1. Comparing col-b with col-a 2. OR operation with comparing col-b with col-1999(constant) 3. Deleting the whole row from the table if the condition satisfies.
在python中写这个(而不是mysql,因为它是一个csv并且避免上传到db)。 我现在的代码如下:
df = pd.read_csv("singleDataFile.csv")
for values in xrange(2000,2016):
val2 = values+1
df['val_'+str(val2)] = df['val_'+str(val2)].where((df['val_'+str(val2)]>df['val_'+str(values)]*1.5) | (df['val_'+str(val2)]<df['val_'+str(values)]*0.75))
print(df)
尝试了替代方法:
df = pd.read_csv("singleDataFile.csv")
cols = [ 'val_{}'.format(c) for c in range(2000, 2018)]
df = pd.DataFrame(df, columns = cols)
df[(df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)] = 'NULL'
在这两种情况下,它都以:
结束getting an error with TypeError: can't multiply sequence by non-int of type 'float'
然而,在这两种方式中它甚至都没有尝试删除整行。怎么实现这个?
CSV TABLE SNIPPET:
val_2000 val_2001 val_2002 val_2003 100 112.058663384525 119.070787312921 117.033250060214 100 118.300395256917 124.655238202362 128.723125524235 100 109.333236619151 116.785836024946 117.390803371386 100 120.954175930764 126.099776250454 124.491022271481 100 107.776153227575 105.560100052722 108.07490649383 100 151.596517146962 306.608812920781 124.610273175528
注意:val_2000之前的列有索引行和一些名称行,也不应该考虑进行迭代。
答案 0 :(得分:1)
您似乎需要any
来检查至少一个True
,然后按~
转换并按boolean indexing
过滤:
#convert all values to float
df = df.astype(float)
#if some bad values (like strings in numeric) replace them to NaN
#df = df.apply(pd.to_numeric, errors='coerce')
print ((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75))
val_2000 val_2001 val_2002 val_2003
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False True True
print (~((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)).any(1))
0 True
1 True
2 True
3 True
4 True
5 False
dtype: bool
df = df[~((df.shift(axis = 1) > df * 1.5) | (df.shift(axis = 1) < df * 0.75)).any(1)]
print (df)
val_2000 val_2001 val_2002 val_2003
0 100 112.058663 119.070787 117.033250
1 100 118.300395 124.655238 128.723126
2 100 109.333237 116.785836 117.390803
3 100 120.954176 126.099776 124.491022
4 100 107.776153 105.560100 108.074906
你需要的IIUC:
const = ['val_'+ str(x) for x in range(1995,2000)]
print (const)
['val_1995', 'val_1996', 'val_1997', 'val_1998', 'val_1999']
for x in const:
df[x] = 1