CSV过滤和升序

时间:2017-02-07 23:59:34

标签: python python-3.x csv

Python新手,所以我需要一些帮助。

我有一个CSV文件,其中包含id,created_at日期,名字/姓氏列。

id  created_at  first_name last_name
1   1309380645  Cecelia    Holt
2   1237178109  Emma       Allison
3   1303585711  Desiree    King
4   1231175716  Sam        Davidson

我想过滤两个日期之间的行,让我们说03-22-201604-15-2016(日期并不重要),然后按升序排序这些行(通过created_at)

我知道这段代码只显示全部或大部分数据

import csv
from datetime import datetime

with open("sample_data.csv") as f:
    reader = csv.reader(f)
    for row in reader:
        print(" ".join(row[]))

但我不知道如何完成剩下的工作,或者如何使用此时间戳1309380645进行过滤 使用pandas对我来说比使用csv更有益吗?

非常感谢任何帮助或阅读指南/书籍以获得更多理解。

2 个答案:

答案 0 :(得分:2)

我建议使用pandas,因为它可以帮助您更快地过滤和执行进一步的分析。

# import pandas and datetime
import pandas as pd
import datetime

# read csv file
df = pd.read_csv("sample_data.csv")

# convert created_at from unix time to datetime
df['created_at'] = pd.to_datetime(df['created_at'], unit='s')

# contents of df at this point
#   id          created_at first_name last_name
# 0   1 2011-06-29 20:50:45    Cecelia      Holt
# 1   2 2009-03-16 04:35:09       Emma   Allison
# 2   3 2011-04-23 19:08:31    Desiree      King
# 3   4 2009-01-05 17:15:16        Sam  Davidson

# filtering example
df_filtered = df[(df['created_at'] <= datetime.date(2011,3,22))]

# output of df_filtered
#    id          created_at first_name last_name
# 1   2 2009-03-16 04:35:09       Emma   Allison
# 3   4 2009-01-05 17:15:16        Sam  Davidson

# filter based on dates mentioned in the question
df_filtered = df[(df['created_at'] >= datetime.date(2016,3,22)) & (df['created_at'] <= datetime.date(2016,4,15))]

# output of df_filtered would be empty at this point since the 
# dates are out of this range

# sort
df_sorted = df_filtered.sort_values(['created_at'])

pandas中的过滤说明:

首先需要知道的是,在数据帧上使用比较运算符会返回带有布尔值的数据帧。

df['id'] > 2

会返回

False
False
 True
 True

现在,pandas支持逻辑索引。因此,如果将带有布尔值的数据帧传递给pandas,if将仅返回与True对应的数据帧。

df[df['id'] > 2]

返回

3   1303585711  Desiree    King
4   1231175716  Sam        Davidson

这是你可以在pandas中轻松过滤的方法

答案 1 :(得分:1)

下载和安装(和学习)pandas只是为了做到这一点似乎有点矫枉过正。

以下是使用Python内置模块的方法:

import csv
from datetime import datetime, date
import sys

start_date = date(2011, 1, 1)
end_date = date(2011, 12, 31)

# Read csv data into memory filtering rows by the date in column 2 (row[1]).
csv_data = []
with open("sample_data.csv", newline='') as f:
    reader = csv.reader(f, delimiter='\t')
    header = next(reader)
    csv_data.append(header)
    for row in reader:
        creation_date = date.fromtimestamp(int(row[1]))
        if start_date <= creation_date <= end_date:
            csv_data.append(row)

if csv_data:  # Anything found?
    # Print the results in ascending date order.
    print(" ".join(csv_data[0]))
    # Converting the timestamp to int may not be necessary (but doesn't hurt)
    for row in sorted(csv_data[1:], key=lambda r: int(r[1])): 
        print(" ".join(row))