为csv中的列创建不同的元组值,并计算第三列的平均值

时间:2019-07-10 10:43:43

标签: python csv tuples average

我有一个数据集:

string1 string2 rate distance 
A.      C.      1    20
A.      B       2.   30
A.      C.      2.   20

string1和string2有多个元组值。我想为String1和String2找到不同的元组,然后计算相同的速率/距离的平均值。这只是伪数据,原始数据具有特定的元组的多个(10000)。

到目前为止,我已经创建了元组。我不确定如何合并元组并计算平均值

def read_csv(filepath, has_header=False):
    with open(filepath, 'r') as file:
        reader = csv.reader(file)

        data = list(reader)
        header = None
        if has_header:
            header = data[0]
            data = data[1:]


    file.close()
    return data, header

if __name__ == '__main__':

    outfilepath = "data/outfile12.csv"

    outdata = []

    codes, header = read_csv("data/sample.csv", has_header=TRUE)

    # create dictionary
    codes_dict = {

}
        for code in codes:
            codes_dict[(code[0], code[1])]

        for row in codes : 

        #Write logic here

输出应如下所示:

string1 string2 column 
    A      C      0.003    
    A      B     0.00030
    B      A    0.000020

有人可以帮忙吗?

2 个答案:

答案 0 :(得分:2)

您在这里:

= ^ .. ^ =

import pandas as pd
from io import StringIO

# create raw data
raw_data = StringIO("""
string1 string2 rate distance
A. C. 1 20
A. B 2. 30
A. C. 2. 20""")

# load data into data frame
df = pd.read_csv(raw_data, sep=' ')
# calculate divide
df['divide'] = df['rate'] / df['distance']
# drop not needed columns
df = df.drop(columns=['rate','distance'])
# grop by columns and sum values
result = df.groupby(['string1', 'string2']).mean()

输出:

string1 string2          
A.      B        0.066667
        C.       0.075000

答案 1 :(得分:1)

您应该考虑将pandas用于这些任务。 Google会为您自己处理特殊情况(csv文件中没有标题),我将举一个基本示例:

import pandas as pd

首先加载csv,它实际上取决于其格式,因此可能需要更改分隔符,我从示例数据(多个空格)中提取了csv格式:

dataframe = pd.read_csv(filepath, sep='\s+')

然后按一组列将数据分组:

groupby = dataframe.groupby(['string1','string2'])
print(groupby.groups) 

这将返回一个“ DataFrameGroupBy”对象,该对象实质上是包装器中的列表(列值的元组,与该数据匹配的行的数据框)。

然后将自定义函数应用于这些行以创建新行:

def add_average_velocity(input_rows):
    input_rows['avg_velocity'] = (input_rows['rate']/input_rows['distance']).mean()
    return input_rows

new_dataframe = dataframe.groupby(['string1','string2']).apply(add_average_velocity).reset_index()
print(new_dataframe)

或者,如果您想完全摆脱所有旧数据,而只保留新数据:

def add_average_velocity(input_rows):
    output_data = pd.Series({'velocity':(input_rows['rate']/input_rows['distance']).mean()})
    # you can skip making a pd.Series objects if you are okay with having the data unnamed in resulting dataframe. You can always rename columns later anyway.
    return output_data

new_dataframe = dataframe.groupby(['string1','string2']).apply(add_average_velocity).reset_index()
print(new_dataframe)