sklearn / PCA-尝试转换高维数据时出错

时间:2020-10-19 12:32:05

标签: python pandas scikit-learn pca

在尝试使用PCA将高维向量转换为2维时遇到数据错误。

这是我的输入data,每行有300个维度:

                                                  vector
0      [0.01053525, -0.007869658, 0.0024931028, -0.04...
1      [-0.024436072, -0.016484523, 0.03859031, 0.000...
2      [0.015011676, -0.020465894, 0.004854744, -0.00...
3      [-0.010836455, -0.006562917, 0.00265073, 0.022...
4      [-0.018123362, -0.026007563, 0.04781856, -0.03...
...                                                  ...
45124  [-0.016111804, -0.041917775, 0.010192914, -0.0...
45125  [0.0311568, -0.013044083, 0.030656694, -0.0126...
45126  [-0.021875003, -0.005635035, 0.0076896898, -0....
45127  [-0.0062000924, -0.041035958, 0.0077403532, 0....
45128  [0.007794927, 0.0019561667, 0.15995999, -0.054...

[45129 rows x 1 columns]

我的代码:

data = pd.read_parquet('1.parquet', engine='fastparquet')

reduced = pca.fit_transform(data)

错误:

TypeError                                 Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'list'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-15-8e547411a212> in <module>
----> 1 reduced = pca.fit_transform(data)
...
...
ValueError: setting an array element with a sequence.

修改

>>data.shape
(45129, 1)
>>data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45129 entries, 0 to 45128
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   vector  45129 non-null  object
dtypes: object(1)
memory usage: 352.7+ KB


1 个答案:

答案 0 :(得分:1)

Scikit-learn不知道如何处理包含数组(列表)的列,因此您需要扩展该列。由于每一行都有相同大小的数组,因此仅需45,000行就可以相当容易地做到这一点。扩展数据后,就可以了。

import pandas as pd
from sklearn.decomposition import PCA
​
df = pd.DataFrame({"a": [[0.01, 0.02, 0.03], [0.04, 0.4, 0.1]]})
expanded_df = pd.DataFrame(df.a.tolist())
expanded_df
0   1   2
0   0.01    0.02    0.03
1   0.04    0.40    0.10
pca = PCA(n_components=2)
reduced = pca.fit_transform(expanded_df)
reduced
array([[ 1.93778224e-01,  1.43048962e-17],
       [-1.93778224e-01,  1.43048962e-17]])