我已将PCA应用于大约1000个观测值的数组中,但只想将观察值保留在新数组中,如果原始数组中的某个特征=某些东西。
我有一个numpy数组df2
和一个数据框df
。我想找到df2
df.Position
为CDM
的所有行。
我的实际数据:
df2
[[ -6.00987823e+00 4.46585005e+00]
[ -7.09055159e+00 1.89437600e+00]
[ -5.91044431e+00 -1.97888707e+00]
[ -4.85698965e+00 -1.09936724e+00]
[ -4.01780368e-01 -2.57178392e+00]
[ -2.97351215e+00 -3.15940358e+00]
[ -4.27973589e+00 2.82707326e+00]
[ 3.95086576e+00 1.08281922e+00]
[ -2.94075361e+00 -1.95544661e+00]
[ -4.83788056e+00 2.32369496e+00]
[ -5.00473716e+00 -3.37680552e-01]
[ -4.88905829e+00 -1.55527476e+00]
[ -3.38202709e+00 -1.04402867e+00]
[ -2.14261510e+00 -5.30757477e-01]
[ 3.00813803e-01 -2.11010985e+00]
[ -2.67824986e+00 -1.83303905e+00]
[ -1.64547049e+00 -2.48056250e+00]
[ -2.92550543e+00 -3.02363170e+00]
[ -4.01116933e+00 2.90363840e+00]
[ -1.04571206e+00 7.58064433e-01]
[ 2.34068739e-01 -2.33981296e+00]
[ 3.15597517e+00 1.09429188e+00]
[ -3.83828970e+00 1.14195305e-01]
[ -7.33794066e-01 -3.70152816e+00]
[ 8.21789967e-01 -4.77818413e-01]
[ -3.29257688e+00 -1.61887349e+00]
[ -4.24297171e+00 2.27187714e+00]
[ 1.45714199e+00 -3.56024788e+00]
[ 1.79855738e+00 -3.71818328e-01]
[ 3.68171085e-01 -3.52961707e+00]
[ 3.77585412e+00 -3.01627595e-01]
[ -4.21740128e+00 -1.30913719e+00]
[ -3.85041585e+00 -1.05515969e+00]
[ -5.01752378e+00 4.67348167e-01]
[ 3.65943448e+00 9.21016483e-01]
[ 3.12159896e+00 -1.25707872e-01]
[ -4.50219722e+00 -4.06752784e+00]
[ -3.92172250e+00 -2.88567430e+00]
[ -2.68908475e-01 -2.17506629e+00]
[ -1.13728112e+00 -2.66843007e+00]
[ -8.73467957e-01 -1.24389494e+00]
[ 3.21966300e+00 -1.35271239e-01]
[ -4.31060796e+00 -1.90505910e+00]
[ 3.73904981e+00 7.70228802e-01]
[ 1.02646986e+00 -5.91828676e-01]
[ 8.43840480e-01 -1.49636218e+00]
[ 1.54065978e+00 -1.65086030e+00]
[ 2.96602068e+00 -7.41024474e-01]
[ 6.53636345e-01 3.04647288e-01]
[ 2.59236989e+00 -6.70435261e-02]
[ 2.00184665e-01 -1.55230314e+00]
[ -7.29533092e-01 -2.73390749e+00]
[ -2.93578745e+00 -2.18118257e+00]
[ -4.37481195e+00 1.02701222e+00]
[ 1.00713302e+00 -1.39943282e+00]
...]
df
(只是在足球/足球比赛中的位置 - FB,CB,CDM,CM,AM,FW)
Position
FW
FW
FW
FW
FB
AM
FW
CB
AM
FW
AM
FW
AM
CM
FB
AM
CM
CM
FW
CM
CDM
CB
AM
FB
CDM
FW
FW
CDM
FB
CDM
CB
AM
...
AM
过滤时,我得到此输出(以及FutureWarning
):
我哪里出错了,如何正确过滤数据?
答案 0 :(得分:1)
FutureWarning
可能是由于您的numpy
和pandas
版本已过期造成的。您可以使用以下方法升级它们:
pip install --upgrade numpy pandas
至于过滤,有很多选择。在这里,我提到每个虚拟数据。
<强>设置强>
df
name colour a b c d e f
0 john red 1 2 3 4 5 6
1 james red 2 3 4 5 6 7
2 jane blue 1 2 3 5 7 8
df2
0 1
0 0.122 0.222
1 0.343 0.345
2 0.345 0.563
选项1
boolean indexing
df2[df.colour == 'red']
Out[726]:
0 1
0 0.122 0.222
1 0.343 0.345
选项2
df.eval
df2[df.eval('colour == "red"')]
Out[732]:
0 1
0 0.122 0.222
1 0.343 0.345
请注意,即使df2
是表单的numpy
数组,这两个选项都有效:
array([[ 0.122, 0.222],
[ 0.343, 0.345],
[ 0.345, 0.563]])
对于您的实际数据,您需要按照相同的方式执行操作:
df2
array([[-6.01 , 4.466],
[-7.091, 1.894],
[-5.91 , -1.979],
[-4.857, -1.099],
[-0.402, -2.572],
[-2.974, -3.159],
[-4.28 , 2.827],
[ 3.951, 1.083],
[-2.941, -1.955],
[-4.838, 2.324],
[-5.005, -0.338],
[-4.889, -1.555],
[-3.382, -1.044],
[-2.143, -0.531],
[ 0.301, -2.11 ],
[-2.678, -1.833],
[-1.645, -2.481],
[-2.926, -3.024],
[-4.011, 2.904],
[-1.046, 0.758],
[ 0.234, -2.34 ],
[ 3.156, 1.094],
[-3.838, 0.114],
[-0.734, -3.702],
[ 0.822, -0.478],
[-3.293, -1.619],
[-4.243, 2.272],
[ 1.457, -3.56 ],
[ 1.799, -0.372],
[ 0.368, -3.53 ],
[ 3.776, -0.302],
[-4.217, -1.309]])
df
Position
0 FW
1 FW
2 FW
3 FW
4 FB
5 AM
6 FW
7 CB
8 AM
9 FW
10 AM
11 FW
12 AM
13 CM
14 FB
15 AM
16 CM
17 CM
18 FW
19 CM
20 CDM
21 CB
22 AM
23 FB
24 CDM
25 FW
26 FW
27 CDM
28 FB
29 CDM
30 CB
31 AM
df2[df.Position == 'CDM']
array([[ 0.234, -2.34 ],
[ 0.822, -0.478],
[ 1.457, -3.56 ],
[ 0.368, -3.53 ]])
答案 1 :(得分:1)
我认为你需要boolean indexing:
from sklearn.decomposition import PCA
import pandas as pd
d = {'d': [4, 5, 5],
'a': [1, 2, 1],
'name': ['john', 'james', 'jane'],
'e': [5, 6, 7],
'f': [6, 7, 8], 'c': [3, 4, 3],
'b': [2, 3, 2],
'colour': ['red', 'red', 'blue']}
cols = ['name', 'colour', 'a', 'b', 'c', 'd', 'e', 'f']
df = pd.DataFrame(d, columns = cols)
print (df)
name colour a b c d e f
0 john red 1 2 3 4 5 6
1 james red 2 3 4 5 6 7
2 jane blue 1 2 3 5 7 8
#create mask by condition
mask = df['colour'] == 'red'
#for multiple values
#mask = df['colour'].isin(['red', 'green', 'blue'])
print (mask)
0 True
1 True
2 False
Name: colour, dtype: bool
#filter only numeric values and convert to numpy array
arr = df.drop(['name','colour'], axis=1).values
print (arr)
[[1 2 3 4 5 6]
[2 3 4 5 6 7]
[1 2 3 5 7 8]]
pca = PCA(n_components=5)
pca.fit(arr)
print (pca.components_ )
[[-0.0463861 -0.0463861 -0.0463861 -0.35279184 -0.65919758 -0.65919758]
[ 0.55515147 0.55515147 0.55515147 0.21897879 -0.11719389 -0.11719389]
[ 0.62531284 -0.13184966 -0.136648 -0.71363037 0.17840759 0.17840759]]
#filter by condition
arr1 = pca.components_ [mask]
print (arr1)
[[-0.0463861 -0.0463861 -0.0463861 -0.35279184 -0.65919758 -0.65919758]
[ 0.55515147 0.55515147 0.55515147 0.21897879 -0.11719389 -0.11719389]]