导入CSV,重塑变量的数组以进行逻辑回归

时间:2020-04-11 15:36:40

标签: python numpy statistics regression reshape

我希望在COVID-19大流行期间,每个人都能保持安全。我是Python的新手,并且对于将数据从CSV导入Python进行快速逻辑回归分析(其中因变量是二进制且自变量是连续的)有一个快速的问题。

我导入了一个CSV文件,然后希望使用一个变量(Active)作为自变量,另一个变量(Smoke)作为响应变量。我能够将CSV文件加载到Python中,但是每次尝试生成一个逻辑回归模型来预测“运动”中的冒烟时,都会收到一个错误,指出“运动”必须重塑为一列(二维),因为它目前是一列尺寸。

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
data = pd.read_csv('Pulse.csv') # Read the data from the CSV file
x = data['Active'] # Load the values from Exercise into the independent variable
x = np.array.reshape(-1,1)
y = data['Smoke'] # The dependent variable is set as Smoke

我一直收到以下错误消息:

ValueError:预期为2D数组,但改为1D数组: array = [97. 82. 88. 106. 78. 109. 66. 68. 100. 70. 98. 140. 105. 84。 134. 117. 100. 108. 76. 86. 110. 65. 85. 80. 87. 133. 125. 61。 117. 90. 110. 68. 102. 67. 112. 86. 85. 66. 73. 85. 110.97。 93. 86. 80. 96. 74. 124. 78. 93. 80. 80. 92. 69. 82. 88。 74. 74. 75. 120. 105. 104. 99. 113. 67. 125. 133. 98. 80. 91。 76. 78. 94. 150. 92. 96. 68. 82. 102. 69. 65. 84. 86. 84。 116. 88. 65. 101. 89. 128. 68. 90. 80. 80. 98. 90. 82. 97。 90. 98. 88. 94. 92. 96. 80. 66. 110. 87. 88. 94. 96. 89。 74. 111. 81. 98. 99. 65. 95. 127. 76. 102. 88. 125. 72. 76。 112. 69. 101. 72. 112. 81. 90. 96. 66. 114. 71. 75. 102. 138。 85. 80. 107. 119. 98. 95. 95. 76. 96. 102. 82. 99. 80. 83。 102. 102. 106. 79. 80. 79. 110. 144. 80. 97. 60. 80. 108. 107。 51. 68. 80. 80. 60. 64. 87. 110. 110. 82. 154. 139. 86. 95。 112. 120. 79. 64. 84. 65. 60. 79. 79. 70. 75. 107. 78. 74。 80. 121. 120. 96. 75. 106. 88. 91. 98. 63. 95. 85. 83. 92。 81. 89. 103. 110. 78. 122. 122. 71. 65. 92. 93. 88. 90. 56。 95. 83. 97. 105. 82. 102. 87. 81.]。 如果数据具有单个特征,则使用array.reshape(-1,1)来重塑数据;如果包含单个样本,则使用array.reshape(1,-1)来重塑数据。

以下是完整的,更新的代码,其中包含错误(04/12/2020): *我无法在该文档中输入错误日志,因此我已将其复制并粘贴到此公共Google文档中:https://docs.google.com/document/d/1vtrj6Znv54FJ4Zvv211TQvvCN6Ac5LDaOfvHicQn0nU/edit?usp=sharing

此外,这是CSV文件: https://drive.google.com/file/d/1g_-vPNklxRn_3nlNPsR-IOflLfXSzFb1/view?usp=sharing

scikit-learn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
data = pd.read_csv('Pulse.csv')
x = data['Active']
y = data['Smoke']
lr = LogisticRegression().fit(x.values.reshape(-1,1), y)
p_pred = lr.predict_proba(x.values)
y_pred = lr.predict(x.values)
score_ = lr.score(x.values,y.values)
conf_m = confusion_matrix(y.values,y_pred.values)
report = classification_report(y.values,y_pred.values)
confusion_matrix(y, lr.predict(x))    
cm = confusion_matrix(y, lr.predict(x))
fig, ax = plt.subplots(figsize = (8,8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0,1), ticklabels = ('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0,1), ticklabels = ('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j,i,cm[i,j],ha='center',va='center',color='red', size='45')
plt.show()
print(classification_report(y,model.predict(x)))

2 个答案:

答案 0 :(得分:0)

尝试一下:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

data = pd.read_csv('Pulse.csv') # Read the data from the CSV file
x = data['Active'] # Load the values from Exercise into the independent variable
y = data['Smoke'] # The dependent variable is set as Smoke

lr = LogisticRegression().fit(x.values.reshape(-1,1), y)

答案 1 :(得分:0)

下面的代码应该可以工作:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
data = pd.read_csv('Pulse.csv')
x = pd.DataFrame(data['Smoke'])
y = data['Smoke']
lr = LogisticRegression()
lr.fit(x,y)
p_pred = lr.predict_proba(x)
y_pred = lr.predict(x)
score_ = lr.score(x,y)
conf_m = confusion_matrix(y,y_pred)
report = classification_report(y,y_pred)

print(score_)
0.8836206896551724

print(conf_m)
[[204   2]
 [ 25   1]]