Question

我正在使用两个大文件;每个大约100K +行，我想在csv文件＃1中搜索csv文件＃2中包含的字符串，然后根据匹配条件将csv文件＃1中的另一个字符串连接到csv文件＃2中的行。这是我正在使用的数据和我的预期输出的示例：

文件＃1：文件＃2中要匹配的字符串是第二个元素; 1st将被附加到文件＃2中的每个匹配行。 （要添加的整数为粗体;为了清晰起见，要匹配的字符串为斜体

第1行：

3604430123 ， mta0000cadd503c.mta.net

第2行：

3604434567 ， mta0000CADD5638.MTA.NET

第3行：

3606304758 ， mta00069234e9a51.DT.COM

文件＃2：

第1行：

4246,211-015617， mta0000cadd503c.mta.net ，旧版，NW MG2，BBand2 ESA，有效

第2行：

7251，ACCOUNT， mta0000CADD5638.MTA.NET ，FQDN，NW MG2，BBand2 ESA，有效

第3行：

536887946,874-22558501， mta00069234e9a51.DT.COM ，“P”，NW MG2，BBand2 ESA，Active

Desired Output根据文件＃1和文件＃2之间的字符串匹配，将文件＃1中的粗体整数字符串连接到文件＃2中的整行：

第1行：

4246,211-015617， mta0000cadd503c.mta.net ，旧版，NW MG2，BBand2 ESA，有效， 3604430123

第2行：

7251，ACCOUNT， mta0000CADD5638.MTA.NET ，FQDN，NW MG2，BBand2 ESA，有效， 3604434567

第3行：

536887946,874-22558501， mta00069234e9a51.DT.COM ，“P”，NW MG2，BBand2 ESA，有效， 3606304758

在许多情况下，文件＃1的匹配字符串中的大小写与文件＃2的大小写不匹配，但是字符匹配，因此匹配标准可以忽略大小写。在从文件＃1附加整数字符串后，需要在文件＃2中保留字符大小写。

我是一个蟒蛇新手，我已经在这一段时间了，已经搜索了SE的帖子，但似乎无法提出工作代码，让我到了可以打印出来的地步来自文件＃2的行已匹配文件＃1中的字符串。我已经尝试了一些其他的方法，比如写字典，使用Dictreader等，但是无法清除那些方法中看似简单的错误，所以我试图把它拆成简单的列表和到了我可以使用列表推导来组合数据的点，然后将其写回一个名为output的文件，最终将其写回csv文件。任何帮助或建议将不胜感激。

import csv

sg = []
fqdn = []
output = []
with open(r'file2.csv', 'rb') as src:
    read = csv.reader(src, delimiter=',')
    for row in read:
        sg.append(row)

with open(r'file1.csv', 'rb') as src1:
    read1 = csv.reader(src1, delimiter=',')
    for row in read1:
        fqdn.append(row)


output = output.append([s[0] for s in sg if fqdn[1] in sg])

print output

运行后的结果是：

无

处理完成，退出代码为0

Answer 1

您应该使用字典＃1而不是列表，因为匹配更容易。只需将fqdn转换为dict即可在循环中读取文件＃1，在dict上设置键值对。我会在匹配键上使用.lower()。这会将键变为小写，因此您以后只需要检查文件＃2中字段的低位版本是否是字典中的键：

import csv

sg = []
fqdn = {}
output = []
with open(r'file2.csv', 'rb') as src:
    read = csv.reader(src, delimiter=',')
    for dataset in read:
        sg.append(dataset)

with open(r'file1.csv', 'rb') as src1:
    read1 = csv.reader(src1, delimiter=',')
    for to_append, to_match in read1:
        fqdn[to_match.lower()] = to_append

for dataset in sg:
    to_append = fqdn.get(dataset[2].lower()) # If the key matched, to_append now contains the string to append, else it becomes None
    if to_append:
        dataset.append(to_append) # Append the field
        output.append(dataset) # Append the row to the result list

print(output)

然后，您可以使用csv.writer从结果中创建csv文件。

Answer 2

这是解决这个问题的强力解决方案。对于第一个文件的每一行，您将搜索第二个文件的每一行，直到找到匹配项。匹配的行将以您使用csv writer指定的格式写出到output.csv文件。

import csv

with open('file1.csv', 'r') as file1:
    with open('file2.csv', 'r') as file2:
        with open('output.csv', 'w') as outfile:
            writer = csv.writer(outfile)
            reader1 = csv.reader(file1)
            reader2 = csv.reader(file2)

            for row in reader1:
                if not row:
                    continue

                for other_row in reader2:
                    if not other_row:
                        continue

                    # if we found a match, let's write it to the csv file with the id appended
                    if row[1].lower() == other_row[2].lower():
                        new_row = other_row
                        new_row.append(row[0])
                        writer.writerow(new_row)
                        continue

                # reset file pointer to beginning of file
                file2.seek(0)

在将信息写入文件之前，您可能想要将信息存储在数据结构中。根据我的经验，您将来最终会获得更大的文件，并可能遇到内存问题。我喜欢在找到匹配项时写出文件以避免这个问题。

使用第二个CSV文件

2 个答案: