Question

我正在尝试遍历Series数据类型，该数据类型是从现有数据集中随机生成的，用作训练数据集。以下是拆分后Series数据集的输出：

Index     data
0         1150
1         2000
2         1800
.         .
.         .
.         .
1960      1800
1962      1200
.         .
.         .
.         .
20010     1500

没有1961年的索引，因为创建训练数据集的随机选择过程将其删除。当我尝试循环计算剩余的总和方块时，它不起作用。这是我的循环代码：

def ResidSumSquares(x, y, intercept, slope):    
    out = 0
    temprss = 0
    for i in x:
        out = (slope * x.loc[i]) + intercept
        temprss = temprss + (y.loc[i] - out)
    RSS = temprss**2
    return print("RSS: {}".format(RSS))

KeyError: 'the label [1961] is not in the [index]'

我仍然是Python的新手，我不确定解决此问题的最佳方法。

提前谢谢。

Answer 1

我在发布问题后立即找到答案，道歉。 @mkln发表

How to reset index in a pandas data frame?

df = df.reset_index(drop=True)

这会重置整个Series的索引，而不是DataFrame数据类型的唯一索引。

我更新的功能代码就像魅力一样：

def ResidSumSquares(x, y, intercept, slope):    
    out = 0
    myerror = 0
    x = x.reset_index(drop=True)    
    y = y.reset_index(drop=True)    
    for i in x:      
        out = slope * x.loc[i] + float(intercept)
        myerror = myerror + (y.loc[i] - out)
    RSS = myerror**2
    return print("RSS: {}".format(RSS))

Answer 2

您省略了对ResidSumSquares的实际通话。如何不重置函数中的索引并将训练集作为x传递。迭代一个不寻常的（不是1,2,3，...）索引应该不是问题

Answer 3

一些观察结果：

正如目前所写，你的函数是计算误差的平方和，而不是平方误差的总和......这是故意的吗？后者通常是回归类型应用程序中使用的。由于您的变量名为RSS - 我假设剩余总和的方块，您需要重新访问。
如果x和y是同一个较大数据集的一致子集，那么两者应该具有相同的索引，对吧？否则，通过删除索引，您可能会匹配不相关的x和y变量，并掩盖代码中较早的错误。
由于您正在使用Pandas，因此可以轻松地对其进行矢量化以提高可读性和速度（Python循环具有较高的开销）

（3）的示例，假设（2），并且示出了（1）中的方法之间的差异：

#assuming your indices should be aligned, 
#pandas will link xs and ys by index
vectorized_error = y - slope*x + float(intercept)
#your residual sum of squares--you have to square first!
rss = (vectorized_error**2).sum()
# if you really want the square of the summed errors...
sse = (vectorized_error.sum())**2

编辑：没有注意到这已经死了一年了。

Python，Pandas：80/20随机拆分数据;索引值“缺失”时如何循环？

3 个答案: