我试图在表示为np数组的数据上使用PCA。
到目前为止,这是我的代码:
def principal_components():
pca = PCA(n_components=2)
print type(windows(training_files))
training_result = pca.fit_transform(windows(training_files))
print type(training_result)
testing_result = pca.transform(windows(test_files))
return training_result, testing_result
windows(training_files)
的类型为<type 'numpy.ndarray'>
,windows(testing_files)
。
我得到追溯:
Traceback (most recent call last):
File "/Users/saqibali/PycharmProjects/sensorLogProject/FeatureSelection/FeatureSelection.py", line 63, in <module>
principal_components()
File "/Users/saqibali/PycharmProjects/sensorLogProject/FeatureSelection/FeatureSelection.py", line 57, in principal_components
training_result = pca.fit_transform(list(windows(training_files)))
File "/Users/saqibali/Library/Python/2.7/lib/python/site-packages/sklearn/decomposition/pca.py", line 324, in fit_transform
U, S, V = self._fit(X)
File "/Users/saqibali/Library/Python/2.7/lib/python/site-packages/sklearn/decomposition/pca.py", line 346, in _fit
copy=self.copy)
File "/Users/saqibali/Library/Python/2.7/lib/python/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number
Process finished with exit code 1
如何将np数组正确传入PCA.fit_transform()?如果我不能,我应该如何重新格式化我的数据以使用PCA?
以下是windows()
def windows(files):
x = []
for my_files in files:
df = pd.DataFrame(columns=['timestamp', 'time skipped', 'x', 'y', 'z', 'label']).set_index('timestamp')
with open(os.path.join("/Users", "saqibali", "PycharmProjects", "sensorLogProject", "Data", my_files), 'rU')\
as my_file:
for d in sliding_window(sample_difference(my_file), 500, 250):
df = df.append(d)
x = np.append(x, df[['x', 'y', 'z']].values.tolist)
return x
示例文件是:
1501514704.745, 0, -0.055908, -0.729034, -0.645294, 3
1501514704.755, 0, -0.046158, -0.709091, -0.650177, 3
1501514704.765, 0, -0.036469, -0.699554, -0.672668, 3
1501514704.775, 0, -0.027908, -0.695740, -0.678070, 3
1501514704.785, 0, -0.027725, -0.678802, -0.697052, 3
1501514704.795, 0, -0.037491, -0.660660, -0.719605, 3
sliding_window(sample_difference(my_file), 500, 250)
返回:
[ x y z
timestamp
1970-01-01 00:00:01.501514704 -0.055908 -0.729034 -0.645294
1970-01-01 00:00:01.501514704 -0.046158 -0.709091 -0.650177
1970-01-01 00:00:01.501514704 -0.036469 -0.699554 -0.672668
1970-01-01 00:00:01.501514704 -0.027908 -0.695740 -0.678070
.
.
.
timestamp x y z
1970-01-01 00:00:01.501514705 0.186447 -0.733322 -1.018127
1970-01-01 00:00:01.501514705 0.151810 -0.722305 -0.996490
1970-01-01 00:00:01.501514705 0.112946 -0.712280 -1.001602
1970-01-01 00:00:01.501514705 0.091904 -0.725403 -0.982437
x
的输出为[]
df[['x', 'y', 'z']].values.tolist()
的输出是:
[[-0.055908000000000006, -0.729034, -0.6452939999999999], [-0.046158, -0.709091, -0.650177], [-0.044601, -0.657684, -0.8261569999999999], [-0.022994999999999998, -0.6634979999999999, -0.8344879999999999], [-0.021896000000000002, -0.647369, -0.848801], [-0.0054020000000000006, -0.673096, -0.787338], [-0.99649], [0.112946, -0.71228, -1.001602]...