我有一个数据集:
string1 string2 rate distance
A. C. 1 20
A. B 2. 30
A. C. 2. 20
string1和string2有多个元组值。我想为String1和String2找到不同的元组,然后计算相同的速率/距离的平均值。这只是伪数据,原始数据具有特定的元组的多个(10000)。
到目前为止,我已经创建了元组。我不确定如何合并元组并计算平均值
def read_csv(filepath, has_header=False):
with open(filepath, 'r') as file:
reader = csv.reader(file)
data = list(reader)
header = None
if has_header:
header = data[0]
data = data[1:]
file.close()
return data, header
if __name__ == '__main__':
outfilepath = "data/outfile12.csv"
outdata = []
codes, header = read_csv("data/sample.csv", has_header=TRUE)
# create dictionary
codes_dict = {
}
for code in codes:
codes_dict[(code[0], code[1])]
for row in codes :
#Write logic here
输出应如下所示:
string1 string2 column
A C 0.003
A B 0.00030
B A 0.000020
有人可以帮忙吗?
答案 0 :(得分:2)
您在这里:
= ^ .. ^ =
import pandas as pd
from io import StringIO
# create raw data
raw_data = StringIO("""
string1 string2 rate distance
A. C. 1 20
A. B 2. 30
A. C. 2. 20""")
# load data into data frame
df = pd.read_csv(raw_data, sep=' ')
# calculate divide
df['divide'] = df['rate'] / df['distance']
# drop not needed columns
df = df.drop(columns=['rate','distance'])
# grop by columns and sum values
result = df.groupby(['string1', 'string2']).mean()
输出:
string1 string2
A. B 0.066667
C. 0.075000
答案 1 :(得分:1)
您应该考虑将pandas
用于这些任务。 Google会为您自己处理特殊情况(csv文件中没有标题),我将举一个基本示例:
import pandas as pd
首先加载csv,它实际上取决于其格式,因此可能需要更改分隔符,我从示例数据(多个空格)中提取了csv格式:
dataframe = pd.read_csv(filepath, sep='\s+')
然后按一组列将数据分组:
groupby = dataframe.groupby(['string1','string2'])
print(groupby.groups)
这将返回一个“ DataFrameGroupBy”对象,该对象实质上是包装器中的列表(列值的元组,与该数据匹配的行的数据框)。
然后将自定义函数应用于这些行以创建新行:
def add_average_velocity(input_rows):
input_rows['avg_velocity'] = (input_rows['rate']/input_rows['distance']).mean()
return input_rows
new_dataframe = dataframe.groupby(['string1','string2']).apply(add_average_velocity).reset_index()
print(new_dataframe)
或者,如果您想完全摆脱所有旧数据,而只保留新数据:
def add_average_velocity(input_rows):
output_data = pd.Series({'velocity':(input_rows['rate']/input_rows['distance']).mean()})
# you can skip making a pd.Series objects if you are okay with having the data unnamed in resulting dataframe. You can always rename columns later anyway.
return output_data
new_dataframe = dataframe.groupby(['string1','string2']).apply(add_average_velocity).reset_index()
print(new_dataframe)