我正在读取两个csv文件,从特定列中选择数据,删除NA / null,然后在一个文件中使用符合某些条件的数据在另一个文件中打印相关数据:
data1 = pandas.read_csv(filename1, usecols = ['X', 'Y', 'Z']).dropna()
data2 = pandas.read_csv(filename2, usecols = ['X', 'Y', 'Z']).dropna()
i=0
for item in data1['Y']:
if item > -20:
print data2['X'][i]
但是这给我一个错误:
File "hashtable.pyx", line 381, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:7035)
File "hashtable.pyx", line 387, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:6976)
KeyError: 6L
当我print data2['X']
时,我看到行数索引中缺少数字
0 -1.953779
1 -2.010039
2 -2.562191
3 -2.723993
4 -2.302720
5 -2.356181
7 -1.928778
...
如何修复此问题并重新编号索引值?或者有更好的方法吗?
答案 0 :(得分:1)
从这里找到另一个问题的解决方案:Reindexing dataframes
.reset_index(drop=True)
可以解决问题!
0 -1.953779
1 -2.010039
2 -2.562191
3 -2.723993
4 -2.302720
5 -2.356181
6 -1.928778
7 -1.925359
答案 1 :(得分:1)
您的两个文件/数据帧长度是否相同?如果是这样,你可以利用布尔掩码并执行此操作(并避免使用for循环):
data2['X'][data1['Y'] > -20]
编辑:回复评论
之间会发生什么:
In [16]: df1
Out[16]:
X Y
0 0 0
1 1 2
2 2 4
3 3 6
4 4 8
In [17]: df2
Out[17]:
Y X
0 64 75
1 65 73
2 36 44
3 13 58
4 92 54
# creates a pandas Series object of True/False, which you can then use as a "mask"
In [18]: df2['Y'] > 50
Out[18]:
0 True
1 True
2 False
3 False
4 True
Name: Y, dtype: bool
# mask is applied element-wise to (in this case) the column of your DataFrame you want to filter
In [19]: df1['X'][ df2['Y'] > 50 ]
Out[19]:
0 0
1 1
4 4
Name: X, dtype: int64
# same as doing this (where mask is applied to the whole dataframe, and then you grab your column
In [20]: df1[ df2['Y'] > 50 ]['X']
Out[20]:
0 0
1 1
4 4
Name: X, dtype: int64