数据预处理

时间:2018-04-15 07:00:03

标签: python scikit-learn

我需要帮助。我是初学者,我真的很困惑。这是我的预处理开始的代码。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Import training set
dataset_train = pd.read_csv('Google_Stock_Price_Train.csv')
training_set = dataset_train.iloc[:, 1:6].values

from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)

使用这个数据集(不完整,我只放了10个,因为实际上有10000个)

  

日期,开盘价,最高价,最低价,收盘价,成交量   1/3 / 2012,325.25,332.83,324.97,663.59," 7380500"   1/4 / 2012,331.27,333.87,329.08,666.45," 5749400"   的1/5 / 2012,329.83,330.75,326.89,657.21," 6590300"   6分之1/ 2012,328.34,328.77,323.68,648.24," 5405900"   9分之1/ 2012,322.04,322.29,309.46,620.76," 11688800"   10分之1/ 2012,313.7,315.72,307.3,621.43," 8824000"   11分之1/ 2012,310.59,313.52,309.4,624.25," 4817800"   一十二分之一/ 2012,314.43,315.26,312.08,627.92," 3764400"   13分之1/ 2012,311.96,312.3,309.37,623.28," 4631800"

我收到此错误

Traceback (most recent call last):

  File "<ipython-input-10-94c47491afd8>", line 3, in <module>
    training_set_scaled = sc.fit_transform(training_set)

  File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\base.py", line 517, in fit_transform
    return self.fit(X, **fit_params).transform(X)

  File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 308, in fit
    return self.partial_fit(X, y)

  File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 334, in partial_fit
    estimator=self, dtype=FLOAT_DTYPES)

  File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)

ValueError: could not convert string to float: '1,770,000'

帮助修复的示例代码会很有帮助

2 个答案:

答案 0 :(得分:0)

您需要删除数字中的逗号:[0, 1, 2].someIntegers() [["hi"]].someStrings() 失败。 我不知道如何/是否可以更改数据,但如果可以,float("7,380,500")将删除数字字符串中的所有逗号。由于您的文件是str.replace(',', ''),因此您需要确保它仅适用于数字列,而不适用于文件中的所有逗号。

答案 1 :(得分:0)

您可以使用'thousands'中的'read_csv'参数。这将格式化数据并从&#39;卷&#39;中的数字之间删除逗号。列,并将其转换为int(默认值),然后可以很容易地转换为float。

dataset_train = pd.read_csv('Google_Stock_Price_Train.csv', thousands=',')

dataset_train['Volume'].dtype
# Output: int64