我有两个非常大的文件:
File1 is formatted as such:
thisismy@email.com:20110708
thisisnotmy@email.com:20110908
thisisyour@email.com:20090807
...
File2是一个csv文件,在行[0]字段中具有相同的电子邮件地址,我需要将日期放入行[5]字段。
我明白如何正确阅读&解析csv,我理解如何读取File1并正确切割它。
我需要帮助的是如何正确搜索CSV文件以查找电子邮件地址的任何实例,并使用相应的日期更新csv。
感谢您的协助。
答案 0 :(得分:1)
您可能想要使用模块re
::
import re
emails = re.findall(r'^(.*\@.*?):', open('filename.csv').read())
这将为您提供所有电子邮件。
答案 1 :(得分:0)
如果您要替换的数据具有固定大小,在您的示例中似乎就是这种情况。您可以使用seek()。在读取文件时查找您的值,获取光标位置并从所需位置写下替换数据。
Cf:Writing in file's actual position in Python
但是,如果您正在处理额外的大文件,使用sed
等命令行工具可以节省大量处理时间。
答案 2 :(得分:0)
下面在Python 2.7上测试的示例:
import csv
# 'b' flag for binary is necessary if on Windows otherwise crlf hilarity ensues
with open('/path/to/file1.txt','rb') as fin:
csv_reader = csv.reader(fin, delimiter=":")
# Header in line 1? Skip over. Otherwise no need for next line.
csv_reader.next()
# populate dict with email address as key and date as value
# dictionary comprehensions supported in 2.7+
# on a lower version? use: d = dict((line[0],line[1]) for line in csv_reader)
email_address_dict = {line[0]: line[1] for line in csv_reader}
# there are ways to modify a file in-place
# but it's easier to write to a new file
with open('/path/to/file2.txt','rb') as fin, \
open('/path/to/file3.txt','wb') as fou:
csv_reader = csv.reader(fin, delimiter=":")
csv_writer = csv.writer(fou, delimiter=":")
# Header in line 1? Skip over. Otherwise no need for next line.
csv_writer.writerow( csv_reader.next() )
for line in csv_reader:
# construct new line
# looking up date value in just-created dict
# the new date value is inserted position 5 (zero-based)
newline = line[0:5]
newline.append(email_address_dict[line[0]])
newline.extend(line[6:])
csv_writer.writerow(newline)