Question

我有两个非常大的文件：

File1 is formatted as such:
thisismy@email.com:20110708
thisisnotmy@email.com:20110908
thisisyour@email.com:20090807
...

File2是一个csv文件，在行[0]字段中具有相同的电子邮件地址，我需要将日期放入行[5]字段。

我明白如何正确阅读＆amp;解析csv，我理解如何读取File1并正确切割它。

我需要帮助的是如何正确搜索CSV文件以查找电子邮件地址的任何实例，并使用相应的日期更新csv。

感谢您的协助。

Answer 1

您可能想要使用模块re ::

进行检查

import re
emails = re.findall(r'^(.*\@.*?):', open('filename.csv').read())

这将为您提供所有电子邮件。

Answer 2

如果您要替换的数据具有固定大小，在您的示例中似乎就是这种情况。您可以使用seek()。在读取文件时查找您的值，获取光标位置并从所需位置写下替换数据。

Cf：Writing in file's actual position in Python

但是，如果您正在处理额外的大文件，使用sed等命令行工具可以节省大量处理时间。

Answer 3

下面在Python 2.7上测试的示例：

import csv

# 'b' flag for binary is necessary if on Windows otherwise crlf hilarity ensues
with open('/path/to/file1.txt','rb') as fin:
  csv_reader = csv.reader(fin, delimiter=":")
  # Header in line 1? Skip over. Otherwise no need for next line.
  csv_reader.next() 
  # populate dict with email address as key and date as value
  # dictionary comprehensions supported in 2.7+
  # on a lower version? use: d = dict((line[0],line[1]) for line in csv_reader)
  email_address_dict = {line[0]: line[1] for line in csv_reader}

# there are ways to modify a file in-place
# but it's easier to write to a new file 
with open('/path/to/file2.txt','rb') as fin, \
     open('/path/to/file3.txt','wb') as fou:
  csv_reader = csv.reader(fin, delimiter=":")
  csv_writer = csv.writer(fou, delimiter=":")
  # Header in line 1? Skip over. Otherwise no need for next line.
  csv_writer.writerow( csv_reader.next() ) 
  for line in csv_reader:
    # construct new line 
    # looking up date value in just-created dict
    # the new date value is inserted position 5 (zero-based)
    newline = line[0:5]
    newline.append(email_address_dict[line[0]])
    newline.extend(line[6:])
    csv_writer.writerow(newline)

如何从file1中搜索特定字符串，并更新csv文件

3 个答案: