在尝试使用PCA将高维向量转换为2维时遇到数据错误。
这是我的输入data
,每行有300个维度:
vector
0 [0.01053525, -0.007869658, 0.0024931028, -0.04...
1 [-0.024436072, -0.016484523, 0.03859031, 0.000...
2 [0.015011676, -0.020465894, 0.004854744, -0.00...
3 [-0.010836455, -0.006562917, 0.00265073, 0.022...
4 [-0.018123362, -0.026007563, 0.04781856, -0.03...
... ...
45124 [-0.016111804, -0.041917775, 0.010192914, -0.0...
45125 [0.0311568, -0.013044083, 0.030656694, -0.0126...
45126 [-0.021875003, -0.005635035, 0.0076896898, -0....
45127 [-0.0062000924, -0.041035958, 0.0077403532, 0....
45128 [0.007794927, 0.0019561667, 0.15995999, -0.054...
[45129 rows x 1 columns]
我的代码:
data = pd.read_parquet('1.parquet', engine='fastparquet')
reduced = pca.fit_transform(data)
错误:
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'list'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-15-8e547411a212> in <module>
----> 1 reduced = pca.fit_transform(data)
...
...
ValueError: setting an array element with a sequence.
修改
>>data.shape
(45129, 1)
>>data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45129 entries, 0 to 45128
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 vector 45129 non-null object
dtypes: object(1)
memory usage: 352.7+ KB
答案 0 :(得分:1)
Scikit-learn不知道如何处理包含数组(列表)的列,因此您需要扩展该列。由于每一行都有相同大小的数组,因此仅需45,000行就可以相当容易地做到这一点。扩展数据后,就可以了。
import pandas as pd
from sklearn.decomposition import PCA
df = pd.DataFrame({"a": [[0.01, 0.02, 0.03], [0.04, 0.4, 0.1]]})
expanded_df = pd.DataFrame(df.a.tolist())
expanded_df
0 1 2
0 0.01 0.02 0.03
1 0.04 0.40 0.10
pca = PCA(n_components=2)
reduced = pca.fit_transform(expanded_df)
reduced
array([[ 1.93778224e-01, 1.43048962e-17],
[-1.93778224e-01, 1.43048962e-17]])