Python新手,所以我需要一些帮助。
我有一个CSV文件,其中包含id,created_at日期,名字/姓氏列。
id created_at first_name last_name
1 1309380645 Cecelia Holt
2 1237178109 Emma Allison
3 1303585711 Desiree King
4 1231175716 Sam Davidson
我想过滤两个日期之间的行,让我们说03-22-2016
和04-15-2016
(日期并不重要),然后按升序排序这些行(通过created_at)
我知道这段代码只显示全部或大部分数据
import csv
from datetime import datetime
with open("sample_data.csv") as f:
reader = csv.reader(f)
for row in reader:
print(" ".join(row[]))
但我不知道如何完成剩下的工作,或者如何使用此时间戳1309380645
进行过滤
使用pandas
对我来说比使用csv更有益吗?
非常感谢任何帮助或阅读指南/书籍以获得更多理解。
答案 0 :(得分:2)
我建议使用pandas,因为它可以帮助您更快地过滤和执行进一步的分析。
# import pandas and datetime
import pandas as pd
import datetime
# read csv file
df = pd.read_csv("sample_data.csv")
# convert created_at from unix time to datetime
df['created_at'] = pd.to_datetime(df['created_at'], unit='s')
# contents of df at this point
# id created_at first_name last_name
# 0 1 2011-06-29 20:50:45 Cecelia Holt
# 1 2 2009-03-16 04:35:09 Emma Allison
# 2 3 2011-04-23 19:08:31 Desiree King
# 3 4 2009-01-05 17:15:16 Sam Davidson
# filtering example
df_filtered = df[(df['created_at'] <= datetime.date(2011,3,22))]
# output of df_filtered
# id created_at first_name last_name
# 1 2 2009-03-16 04:35:09 Emma Allison
# 3 4 2009-01-05 17:15:16 Sam Davidson
# filter based on dates mentioned in the question
df_filtered = df[(df['created_at'] >= datetime.date(2016,3,22)) & (df['created_at'] <= datetime.date(2016,4,15))]
# output of df_filtered would be empty at this point since the
# dates are out of this range
# sort
df_sorted = df_filtered.sort_values(['created_at'])
首先需要知道的是,在数据帧上使用比较运算符会返回带有布尔值的数据帧。
df['id'] > 2
会返回
False
False
True
True
现在,pandas支持逻辑索引。因此,如果将带有布尔值的数据帧传递给pandas,if将仅返回与True对应的数据帧。
df[df['id'] > 2]
返回
3 1303585711 Desiree King
4 1231175716 Sam Davidson
这是你可以在pandas中轻松过滤的方法
答案 1 :(得分:1)
下载和安装(和学习)pandas
只是为了做到这一点似乎有点矫枉过正。
以下是使用Python内置模块的方法:
import csv
from datetime import datetime, date
import sys
start_date = date(2011, 1, 1)
end_date = date(2011, 12, 31)
# Read csv data into memory filtering rows by the date in column 2 (row[1]).
csv_data = []
with open("sample_data.csv", newline='') as f:
reader = csv.reader(f, delimiter='\t')
header = next(reader)
csv_data.append(header)
for row in reader:
creation_date = date.fromtimestamp(int(row[1]))
if start_date <= creation_date <= end_date:
csv_data.append(row)
if csv_data: # Anything found?
# Print the results in ascending date order.
print(" ".join(csv_data[0]))
# Converting the timestamp to int may not be necessary (but doesn't hurt)
for row in sorted(csv_data[1:], key=lambda r: int(r[1])):
print(" ".join(row))