Question

我是Python的新手（使用3.7版）。我有一个通过从csv文件加载列表来创建的数据框。我想更新数据框中的一列（“分数”），该列将保存对数据框中的特定列值执行的计算求和的结果。这是代码段：

#load library
import pandas as pd
#get the data
file_name = "c:\myfile.csv"
df = pd.read_csv(file_name)
#get the variable parameters
sVariableList = ["depth","rpm","pressure","flow_rate","lag" ]
sWeights = [.20, .20, .30, .15, .15] 
sMeans = [57.33283924063220, 7159.6003409761900, 20.270635083327700, 55.102824912342000, 90.67]
sSTD  = [101.803564244615000 , 3124.14373264349000, 32.461940805541400, 93.338695138920900, 61.273]

数据框包含的列比sVariableList中列出的项目更多。 sVariable列表仅代表我要在其上执行计算的字段。我想做的是计算每行的得分-将值存储在“得分”列中。这是我现在正在做的事情，它给出了正确的结果：

#loop through the record and perform the calculation
for row in range(len(df)):
    ind = 0
    nScore = 0.0
    for fieldname in sVariableList: 

        #calculate the score
        nScore = nScore + ( sWeights[ind]*(((df.loc[row, fieldname] - sVariableMeans[ind])/sSTD[ind])**2) )
        ind = ind + 1 #move to the next variable/field index

    #set the result to the field value
    df.loc[row, "Score"] = nScore

但是非常慢。我有900,000条记录的数据集。

我发现有一些文章讨论列表压缩作为迭代的一种可能替代方法，但是我对实现该语言还不够熟悉。任何想法表示赞赏。

谢谢

Answer 1

对基础的numpy数据进行计算，仅将最终结果分配给数据框：

x = np.array([sWeights, sMeans, sSTD])
y = df[sVariableList].to_numpy()
df['Score'] = (x[0] * ((y - x[1]) / x[2])**2).sum(axis=1)

对于900,000条记录，这在我的计算机上大约需要0.15秒。

使用值列表对多个数据框列执行计算，而无需迭代

1 个答案: