通过csv提取记录并按日期过滤

时间:2017-06-29 14:35:11

标签: python csv

我有一个csv文件,其中每条记录都是LinkedIn联系人。我必须重新创建另一个csv文件,其中每个联系人仅在特定日期之后到达(例如,在2017年1月1日之后连接到我的所有联系人)。 所以这是我的实现:

def import_from_csv(file):
    key_order = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
    linkedin_contacts = []
    with open(file, encoding="utf8") as csvfile:
        reader=csv.DictReader(csvfile, delimiter=',')
        for row in reader:
            single_person = {"FirstName": row["FirstName"], "LastName": row["LastName"],
                             "EmailAddress": row["EmailAddress"], "Company": row["Company"],
                             "ConnectedOn": parser.parse(row["ConnectedOn"])}
            od = OrderedDict((k, single_person[k]) for k in key_order)
            linkedin_contacts.append(od)
    return linkedin_contacts

第一个脚本给了我一个有序的字典列表,我不知道我以前用来获得正确顺序的方式是好的,还看到一些例子(如here)我没有使用od .update方法,但我不认为我需要它,它是否正确?

现在我写了第二个函数来过滤列表:

def filter_by_date(connections):
    filtered_list = []
    target_date = parser.parse("01/04/2017")
    for row in connections:
        if row["ConnectedOn"] > target_date:
            filtered_list.append(row)
    return filtered_list

我这样做是否正确?

有没有办法优化代码?感谢

3 个答案:

答案 0 :(得分:1)

对于过滤,您可以使用filter()功能:

def filter_by_date(connections):
    target_date = datetime.strptime("01/04/2017", '%Y/%m/%d').date()
    return list(filter(lambda x: x["ConnectedOn"] > target_date, connections))

而不是创建简单的dict,然后将其值填入OrderedDict,您可以直接将值写入OrderedDict

for row in reader:
    od = OrderedDict()
    od["FirstName"] = row["FirstName"]
    od["LastName"] = row["LastName"]
    od["EmailAddress"] = row["EmailAddress"]
    od["Company"] = row["Company"]
    od["ConnectedOn"] = datetime.strptime(row["ConnectedOn"], '%Y/%m/%d').date()
    linkedin_contacts.append(od)

如果您知道日期格式,则不需要python_dateutil,您可以使用所需格式的内置datetime.datetime.strptime()

答案 1 :(得分:1)

因为您不准确格式字符串。

使用:

from datetime import datetime

format = '%d/%m/%Y'
date_text = '01/04/2017'

# inverse by datetime.strftime(format)
datetime.strptime(date_text, format)  


#....
# with format as global
for row in reader:
   od = OrderedDict()
   od["FirstName"] = row["FirstName"]
   od["LastName"] = row["LastName"]
   od["EmailAddress"] = row["EmailAddress"]
   od["Company"] = row["Company"]
   od["ConnectedOn"] = strptime(row["ConnectedOn"], format)
   linkedin_contacts.append(od)

做:

def filter_by_date(connections, date_text):
        target_date = datetime.strptime(date_text, format) 
        return [x for x in connections if x["ConnectedOn"] > target_dat]

答案 2 :(得分:1)

第一点:您根本不需要OrderedDict,只需要use a csv.DictWriter来编写已过滤的csv。

    fieldnames = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
    with open("/apth/to/final.csv", "wb") as f:
        writer = csv.DictWriter(f, fieldnames)
        writer.writeheader() 
        writer.writerows(filtered_contacts)

第二点:你不需要从csv阅读器产生的新dict创建一个新的dict,只需更新ConnectedOn键:

def import_from_csv(file):
    linkedin_contacts = []
    with open(file, encoding="utf8") as csvfile:
        reader=csv.DictReader(csvfile, delimiter=',')
        for row in reader:
            row["ConnectedOn"] = parser.parse(row["ConnectedOn"])
            linkedin_contacts.append(row)
    return linkedin_contacts

最后,如果你要做的就是获取源csv,过滤掉ConnectedOn上的记录并写出结果,你不需要在内存中加载整个源,创建一个过滤list(再次在内存中)并写入已过滤的列表,您可以流式传输整个操作:

def filter_csv(source_path, dest_path, date):
    fieldnames = ("FirstName","LastName","EmailAddress","Company","ConnectedOn")
    target = parser.parse(date)

    with open(source_path, "rb") as source, open(dest_path, "wb") as dest:
        reader = csv.DictReader(source)
        writer = csv.DictWriter(dest, fieldnames)
        # if you want a header line with the fieldnames - else comment it out
        writer.writeheaders()

        for row in reader:
            row_date = parser.parse(row["ConnectedOn"])
            if row_date > target:
                writer.writerow(row)

在这里,你很简单。

注意:我不知道" parser.parse()"但是,正如其他人提到的答案一样,您可能会更好地使用datetime模块。