我需要帮助。我是初学者,我真的很困惑。这是我的预处理开始的代码。
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import training set
dataset_train = pd.read_csv('Google_Stock_Price_Train.csv')
training_set = dataset_train.iloc[:, 1:6].values
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range = (0, 1))
training_set_scaled = sc.fit_transform(training_set)
使用这个数据集(不完整,我只放了10个,因为实际上有10000个)
日期,开盘价,最高价,最低价,收盘价,成交量 1/3 / 2012,325.25,332.83,324.97,663.59," 7380500" 1/4 / 2012,331.27,333.87,329.08,666.45," 5749400" 的1/5 / 2012,329.83,330.75,326.89,657.21," 6590300" 6分之1/ 2012,328.34,328.77,323.68,648.24," 5405900" 9分之1/ 2012,322.04,322.29,309.46,620.76," 11688800" 10分之1/ 2012,313.7,315.72,307.3,621.43," 8824000" 11分之1/ 2012,310.59,313.52,309.4,624.25," 4817800" 一十二分之一/ 2012,314.43,315.26,312.08,627.92," 3764400" 13分之1/ 2012,311.96,312.3,309.37,623.28," 4631800"
我收到此错误
Traceback (most recent call last):
File "<ipython-input-10-94c47491afd8>", line 3, in <module>
training_set_scaled = sc.fit_transform(training_set)
File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\base.py", line 517, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 308, in fit
return self.partial_fit(X, y)
File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "C:\Users\MAx\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: '1,770,000'
帮助修复的示例代码会很有帮助
答案 0 :(得分:0)
您需要删除数字中的逗号:[0, 1, 2].someIntegers()
[["hi"]].someStrings()
失败。
我不知道如何/是否可以更改数据,但如果可以,float("7,380,500")
将删除数字字符串中的所有逗号。由于您的文件是str.replace(',', '')
,因此您需要确保它仅适用于数字列,而不适用于文件中的所有逗号。
答案 1 :(得分:0)
您可以使用'thousands'
中的'read_csv'
参数。这将格式化数据并从&#39;卷&#39;中的数字之间删除逗号。列,并将其转换为int(默认值),然后可以很容易地转换为float。
dataset_train = pd.read_csv('Google_Stock_Price_Train.csv', thousands=',')
dataset_train['Volume'].dtype
# Output: int64