将np数组传入pca.fit_transform

时间:2017-08-17 14:42:19

标签: arrays numpy scikit-learn pca

我试图在表示为np数组的数据上使用PCA。

到目前为止,这是我的代码:

def principal_components():
    pca = PCA(n_components=2)
    print type(windows(training_files))
    training_result = pca.fit_transform(windows(training_files))
    print type(training_result)
    testing_result = pca.transform(windows(test_files))
    return training_result, testing_result

windows(training_files)的类型为<type 'numpy.ndarray'>windows(testing_files)

我得到追溯:

Traceback (most recent call last):
File "/Users/saqibali/PycharmProjects/sensorLogProject/FeatureSelection/FeatureSelection.py", line 63, in <module>
principal_components()
File "/Users/saqibali/PycharmProjects/sensorLogProject/FeatureSelection/FeatureSelection.py", line 57, in principal_components
training_result = pca.fit_transform(list(windows(training_files)))
File "/Users/saqibali/Library/Python/2.7/lib/python/site-packages/sklearn/decomposition/pca.py", line 324, in fit_transform
U, S, V = self._fit(X)
File "/Users/saqibali/Library/Python/2.7/lib/python/site-packages/sklearn/decomposition/pca.py", line 346, in _fit
copy=self.copy)
File "/Users/saqibali/Library/Python/2.7/lib/python/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number

Process finished with exit code 1

如何将np数组正确传入PCA.fit_transform()?如果我不能,我应该如何重新格式化我的数据以使用PCA?

以下是windows()

def windows(files):
    x = []
    for my_files in files:
        df = pd.DataFrame(columns=['timestamp', 'time skipped', 'x', 'y', 'z', 'label']).set_index('timestamp')
        with open(os.path.join("/Users", "saqibali", "PycharmProjects", "sensorLogProject", "Data", my_files), 'rU')\
            as my_file:
            for d in sliding_window(sample_difference(my_file), 500, 250):
                df = df.append(d)
        x = np.append(x, df[['x', 'y', 'z']].values.tolist)
    return x

示例文件是:

1501514704.745, 0, -0.055908, -0.729034, -0.645294, 3
1501514704.755, 0, -0.046158, -0.709091, -0.650177, 3
1501514704.765, 0, -0.036469, -0.699554, -0.672668, 3
1501514704.775, 0, -0.027908, -0.695740, -0.678070, 3
1501514704.785, 0, -0.027725, -0.678802, -0.697052, 3
1501514704.795, 0, -0.037491, -0.660660, -0.719605, 3

sliding_window(sample_difference(my_file), 500, 250)返回:

[                                      x         y         z
timestamp                                                  
1970-01-01 00:00:01.501514704 -0.055908 -0.729034 -0.645294
1970-01-01 00:00:01.501514704 -0.046158 -0.709091 -0.650177
1970-01-01 00:00:01.501514704 -0.036469 -0.699554 -0.672668
1970-01-01 00:00:01.501514704 -0.027908 -0.695740 -0.678070
.
.
.
timestamp                      x        y          z                                                  
1970-01-01 00:00:01.501514705  0.186447 -0.733322 -1.018127
1970-01-01 00:00:01.501514705  0.151810 -0.722305 -0.996490
1970-01-01 00:00:01.501514705  0.112946 -0.712280 -1.001602
1970-01-01 00:00:01.501514705  0.091904 -0.725403 -0.982437

x的输出为[]

df[['x', 'y', 'z']].values.tolist()的输出是:

[[-0.055908000000000006, -0.729034, -0.6452939999999999], [-0.046158, -0.709091, -0.650177], [-0.044601, -0.657684, -0.8261569999999999], [-0.022994999999999998, -0.6634979999999999, -0.8344879999999999], [-0.021896000000000002, -0.647369, -0.848801], [-0.0054020000000000006, -0.673096, -0.787338], [-0.99649], [0.112946, -0.71228, -1.001602]...

0 个答案:

没有答案