需要使用pandas从CSV计算列

时间:2018-03-26 10:26:02

标签: python pandas

incidentcountlevel1examcount是CSV文件中的两个列名。我想根据这些列计算两列。我写了下面的剧本,但它失败了:

import pandas as pd
import numpy as np
import time, os, fnmatch, shutil
df = pd.read_csv(r"/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',skiprows=[1])
df1 = pd.read_csv(r"/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',skiprows=[1])
df3 = pd.read_csv("/home/corp_sourcing/Metric_Fact_20180324_1227.csv",header='infer',converters={"incidentcountlevel1":int})
inc_count_lvl_1 = df3.loc[:, ['incidentcountlevel1']]
exam_count=df3.loc[:, ['examcount']]

for exam_count in exam_count: #need to iterate this col to calculate for each row

if exam_count < 1:
        print "IPTE Cannot be calculated"

else:
        if inc_count_lvl_1 > 5:
        ipte1= (inc_count_lvl_1/exam_count)*1000
    else:

        dof = 2*(inc_count_lvl_1+ 1)
        chi_square=chi2.ppf(0.5,dof)
        ipte1=(chi_square/(2*exam_count))×1000 

1 个答案:

答案 0 :(得分:1)

您可以在pandas列上应用lamda function。 刚刚使用numpy创建了一个示例。您可以根据自己的情况进行更改

>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 50]})
>>> df['new_column'] = np.multiply(df['A'], df['B'])
>>> df
    A   B  new_column
0  10  20         200
1  20  30         600
2  30  10         1500

或者您可以创建自己的功能:

>>> def fx(x, y):
...     return x*y
...
>>> df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
>>> df
    A   B  new_column
0  10  20         200
1  20  30         600
2  30  10         1500

我的情况,解决方案可能看起来像这样。

df['new_column'] = np.vectorize(fx)(df['examcount'], df['incidentcountlevel1'])

def fx(exam_count,inc_count_lvl_1):
    if exam_count < 1:
        return -1 ##whatever you want
    else:
            if inc_count_lvl_1 > 5:
            ipte1= (inc_count_lvl_1/exam_count)*1000
        else:

            dof = 2*(inc_count_lvl_1+ 1)
            chi_square=chi2.ppf(0.5,dof)
            ipte1=(chi_square/(2*exam_count))×1000 

        return ipte1

如果您不想使用lamda fucntions,则可以使用iterrowsiterrows是一个生成索引和行的生成器。

for index, row in df.iterrows():
    print row['examcount'], row['incidentcountlevel1']
    #do your stuff.

我希望它有所帮助。