我是Python新手,在目录上我有两个csv文件
file1.csv
Id place,Date and hour, Value
1,2018.09.17.12.54,200000
2,2018.09.18.14.16,150000
1,2018.09.19.15.06,78000
3,2018.09.17.16.26,110000
2,2018.09.20.13.54,200000
3,2018.09.17.14.16,150000
1,2018.09.21.12.54,200000
file2.csv
Id place,Date and hour, Value
1,2018.09.24.12.54,200000
3,2018.09.24.14.16,150000
1,2018.09.24.15.06,78000
2,2018.09.26.16.26,110000
1,2018.09.27.12.54,200000
3,2018.09.25.14.16,150000
1,2018.09.28.12.54,200000
3,2018.09.28.14.16,150000
我已阅读direcory中的所有csv文件,并将有关的信息保存在新的csv文件中
输出
Id place, Value
1, 1 156 000
2, 460 000
3, 710 000
输出
Week, average Value
1 , 155428,57 (1088000 / 7)
2 , 154750 (1238000 / 8)
输出
Id place,Week, average Value
1, 1 , 159 333 (478000 / 3)
2, 1 , 175 000 (350000 / 2)
3, 1 , 130 000 (260 000/ 2)
1, 2 , 169 500 (678000 / 4)
2, 2 , 110 000 (110000 / 1)
3, 2 , 150 000 (450000 / 3)
我不知道该怎么做
答案 0 :(得分:3)
我建议使用pandas
:
import glob
import pandas as pd
#get all files
files = glob.glob('files/*.csv')
#create list of DataFrames, if necessary remove traling whitespaces in csv headers
dfs = [pd.read_csv(fp).rename(columns=lambda x: x.strip()) for fp in files]
#join together all files
df = pd.concat(dfs, ignore_index=True)
#convert column to datetimes
df['Date and hour'] = pd.to_datetime(df['Date and hour'], format='%Y.%m.%d.%H.%M')
#convert to weeks and for starting with 1 add factorize
df['week'] = pd.factorize(df['Date and hour'].dt.weekofyear)[0] + 1
print (df)
Id place Date and hour Value week
0 1 2018-09-17 12:54:00 200000 1
1 2 2018-09-18 14:16:00 150000 1
2 1 2018-09-19 15:06:00 78000 1
3 3 2018-09-17 16:26:00 110000 1
4 2 2018-09-20 13:54:00 200000 1
5 3 2018-09-17 14:16:00 150000 1
6 1 2018-09-21 12:54:00 200000 1
7 1 2018-09-24 12:54:00 200000 2
8 3 2018-09-24 14:16:00 150000 2
9 1 2018-09-24 15:06:00 78000 2
10 2 2018-09-26 16:26:00 110000 2
11 1 2018-09-27 12:54:00 200000 2
12 3 2018-09-25 14:16:00 150000 2
13 1 2018-09-28 12:54:00 200000 2
14 3 2018-09-28 14:16:00 150000 2
#aggregate sum
df1 = df.groupby('Id place', as_index=False)['Value'].sum()
print (df1)
Id place Value
0 1 1156000
1 2 460000
2 3 710000
#aggregate mean
df2 = df.groupby('week', as_index=False)['Value'].mean()
print (df2)
week Value
0 1 155428.571429
1 2 154750.000000
#aggregate mean per 2 columns
df3 = df.groupby(['Id place','week'], as_index=False)['Value'].mean()
print (df3)
Id place week Value
0 1 1 159333.333333
1 1 2 169500.000000
2 2 1 175000.000000
3 2 2 110000.000000
4 3 1 130000.000000
5 3 2 150000.000000
#write output DataFrames to files
df1.to_csv('out1.csv', index=False)
df2.to_csv('out2.csv', index=False)
df3.to_csv('out3.csv', index=False)
答案 1 :(得分:1)
绝对不推荐使用,pandas
到目前为止是更好的方法,但是手动执行此操作的方法是使用defaultdict对项目进行分组并最后进行计算。
演示:
from csv import reader
from os import listdir
from collections import defaultdict
from datetime import datetime
from operator import itemgetter
from pprint import pprint
# Collect sums first in a defaultdict
sums = defaultdict(list)
# Collect dates seperately since they are more complicated
dates = []
# Get all csv files and open them
for file in listdir("."):
if file.endswith(".csv"):
with open(file) as f:
csv_reader = reader(f)
# Skip headers
next(csv_reader)
# Separately get sums and dates stuff
for place, date, value in csv_reader:
sums[int(place)].append(int(value))
dates.append(
(place, datetime.strptime(date, "%Y.%m.%d.%H.%M"), int(value))
)
# Print out sum of columns
sum_column_values = {k: sum(v) for k, v in sums.items()}
pprint(sum_column_values)
# Get Minimum date to get weeknumber
min_date = min(map(itemgetter(1), dates)).date().isocalendar()[1]
# Collect weeks stuff in separate dicts
weeks = defaultdict(list)
place_weeks = defaultdict(list)
for place, date, value in dates:
# Weeknumber calculation
week_number = date.date().isocalendar()[1] - min_date + 1
# Collect week stuff
weeks[week_number].append(value)
place_weeks[int(place), week_number].append(value)
# Print out week averages
week_averages = {k: sum(v) / len(v) for k, v in weeks.items()}
pprint(week_averages)
# Print out place/week averages
place_week_averages = {k: sum(v) / len(v) for k, v in place_weeks.items()}
pprint(place_week_averages)
给出以下结果存储在单独的词典中:
# place averages
{1: 1156000, 2: 460000, 3: 710000}
# week averages
{1: 155428.57142857142, 2: 154750.0}
# place/week averages
{(1, 1): 159333.33333333334,
(1, 2): 169500.0,
(2, 1): 175000.0,
(2, 2): 110000.0,
(3, 1): 130000.0,
(3, 2): 150000.0}