使用scikit-learn(sklearn),如何处理线性回归的缺失数据?

时间:2015-10-13 22:53:33

标签: python pandas machine-learning scikit-learn linear-regression

我试过这个,但无法让它适用于我的数据: Use Scikit Learn to do linear regression on a time series pandas data frame

我的数据包含2个DataFrame。 DataFrame_1.shape = (40,5000)DataFrame_2.shape = (40,74)。我正在尝试进行某种类型的线性回归,但DataFrame_2包含NaN个缺失的数据值。当我DataFrame_2.dropna(how="any")时,形状会降至(2,74)

sklearn中是否有可以处理NaN值的线性回归算法?

我正在load_boston之后sklearn.datasets建模X,y = boston.data, boston.target = (506,13),(506,) X = DataFrame_1 for col in DataFrame_2.columns: y = DataFrame_2[col] model = LinearRegression() model.fit(X,y) #ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

这是我的简化代码:

DataFrame_2

我使用上面的格式来获得与矩阵相匹配的形状

如果发布#include <iostream> #include <cmath> #include <cstdlib> #include <string> #include <algorithm> #include <cctype> using namespace std; int main() { string s; char selection; string w; cout << "Enter a paragraph or a sentence : "; getline(cin, s); int sizeOfString = s.length(); //cout << "The paragraph has " << sizeOfString << " characters. " << endl; ***Dummy call to see if size works. //cout << "You entered " << s << endl; *** Dummy function !! cout << "" << endl; cout << " Menu " << endl; cout << " ------------------------" << endl; cout << "" << endl; cout << "A -- Convert paragraph to all caps " << endl; cout << "B -- Convert paragraph to all lowercase " << endl; cout << "C -- Delete whitespaces " << endl; cout << "D -- Split words & remove duplicates " << endl; cout << "E -- Search a certain word " << endl; cout << "" << endl; cout << "Please select one of the above: "; cin >> selection; cout << "" << endl; switch (selection) //Switch statement { case 'a': case 'A': cout << "You chose to convert the paragraph to all uppercase" << endl; cout << "" << endl; for (int i = 0; s[i] != '\0'; i++) { s[i] = toupper(s[i]); } cout << "This is it: " << s << endl; break; case 'b': case 'B': cout << "You chose to convert the paragragh to all lowercase" << endl; cout << "" << endl; for (int i = 0; s[i] != '\0'; i++) { s[i] = tolower(s[i]); } cout << "This is it: " << s << endl; break; case 'c': case 'C': cout << "You chose to delete the whitespaces in the paragraph" << endl; cout << "" << endl; for (int i = 0; i < s.length(); i++) { if (s[i] == ' ') s.erase(i, 1); } cout << "This is it: " << s << endl; break; case 'd': case 'D': cout << "You chose to split the words & remove the duplicates in the paragraph" << endl; cout << "" << endl; case 'e': case 'E': cout << "You chose to search for a certain word in the paragraph. " << endl; cout << "" << endl; cout << "Enter the word you want to search for: "; cin >> w; s.find(w); if (s.find(w) != std::string::npos) { cout << w << " was found in the paragraph. " << endl; } else { cout << w << " was not found in the paragraph. " << endl; } } return 0; } 会有所帮助,请在下方发表评论,我会添加它。

2 个答案:

答案 0 :(得分:3)

您可以使用插补填写y中的空值。在scikit-learn中,使用以下代码段完成此操作:

from sklearn.preprocessing import Imputer
imputer = Imputer()
y_imputed = imputer.fit_transform(y)

否则,您可能希望使用74列的子集作为预测变量来构建模型,也许您的某些列包含的空值较少?

答案 1 :(得分:0)

如果变量是DataFrame,则可以使用fillna。在这里,我用该列的平均值替换了丢失的数据。

df.fillna(df.mean(), inplace=True)