当行具有多个值时,在Python中比较两个CSV文件

时间:2015-09-27 19:18:28

标签: python csv dictionary compare

我有两个我要比较的CSV文件,如下所示:

"a" 1   6   3   1   8
"b" 15  6   12  5   6
"c" 7   4   1   4   8
"d" 14  8   12  11  4
"e" 1   8   7   13  12
"f" 2   5   4   13  9
"g" 8   6   9   3   3
"h" 5   12  8   2   3
"i" 5   9   2   11  11
"j" 1   9   2   4   9

所以“a”拥有数字1,6,3,1,8等。实际的CSV文件长达1,000行,所以你在编写代码时要有效率。

第二个CSV文件如下所示:

4

15

7

9

2

我已经编写了一些代码来将这些CSV文件导入到python中的列表中。

with open('winningnumbers.csv', 'rb') as wn:
    reader = csv.reader(wn)
    winningnumbers = list(reader)

wn1 = winningnumbers[0]
wn2 = winningnumbers[1]
wn3 = winningnumbers[2]
wn4 = winningnumbers[3]
wn5 = winningnumbers[4]

print(winningnumbers)

with open('Entries#x.csv', 'rb') as en:
    readere = csv.reader(en)
    enl = list(readere)

我现在如何使用第一个csv文件搜索CSV文件2的交叉引用号4以及wn1。因此它返回“b”中有wn1。我将它们作为列表导入,看看我是否可以弄清楚如何做到这一点,但最后却以圆圈形式运行。我也尝试过使用dict()但没有成功。

2 个答案:

答案 0 :(得分:3)

如果我理解正确,您希望在获胜的条目中找到数字的第一个索引(或所有索引)。如果你想要它,你可以这样做:

with open('winningnumbers.csv', 'rb') as wn:
    reader = csv.reader(wn)
    winningnumbers = list(reader)

with open('Entries#x.csv', 'rb') as en:
    readere = csv.reader(en)
    winning_number_index = -1 # Default value which we will print if nothing is found
    current_index = 0 # Initial index
    for line in readere: # Iterate over entries file
        all_numbers_match = True # Default value that will be set to False if any of the elements doesn't match with winningnumbers
        for i in range(len(line)):
            if line[i] != winningnumbers[i]: # If values of current line and winningnumbers with matching indexes are not equal
                all_numbers_match = False # Our default value is set to False
                break # Exit "for" without finishing

        if all_numbers_match == True: # If our default value is still True (which indicates that all numbers match)
            winning_number_index = current_index # Current index is written to winning_number_index
            break # Exit "for" without finishing
        else: # Not all numbers match
            current_index += 1 

print(winning_number_index)

这将打印条目中第一个中奖号码的索引(如果您想要所有索引,请在评论中写下来)。

注意:这不是解决问题的最佳代码。如果您不熟悉Python的更高级功能,那么更容易解决和调试。

你应该考虑不要简化你的变量。 entries_reader只需要花费一秒钟的时间来写,然后花费少于5秒来理解readere

这种变体更快,更短,内存效率更高,但可能更难理解:

with open('winningnumbers.csv', 'rb') as wn:
    reader = csv.reader(wn)
    winningnumbers = list(reader)

with open('Entries#x.csv', 'rb') as en:
    readere = csv.reader(en)
    for line_index, line in enumerate(readere):            
        if all((line[i] == winningnumbers[i] for i in xrange(len(line)))):
            winning_number_index = line_index
            break
    else:
        winning_number_index = -1

print(winning_number_index)

我可能不清楚的功能可能是enumerate()any()并在else使用for而不是if。让我们一个接一个地浏览所有这些。

要理解枚举的这种用法,您需要理解语法:

a, b = [1, 2]

将根据列表中的值分配变量ab。在这种情况下,a将为1,b将为2.使用此语法,我们可以这样做:

for a, b in [[1, 2], [2, 3], ['spam', 'eggs']]:
    # do something with a and b

在每次迭代中,a和b将分别为1和2,2和3,'垃圾邮件'和'鸡蛋'。

假设我们有一个列表a = ['spam', 'eggs', 'potatoes']enumerate()只返回一个“列表”,如下所示:[(1,'spam'),(2,'eggs'),(3,'potato')]。所以,当我们这样使用时,

for line_index, line in enumerate(readere):
    # Do something with line_index and line

line_index将是1,2,3,e.t.c。

any()函数接受一个序列(list,tuple,e.t.c。),如果其中的所有元素都等于True,则返回True

生成器表达式mylist = [line[i] == winningnumbers[i] for i in range(len(line))]返回一个列表,类似于以下内容:

mylist = []
for i in range(len(line)):
    mylist.append(line[i] == winningnumbers[i]) # a == b will return True if a is equal to b

所以any只有在条目中的所有数字与中奖号码匹配的情况下才会返回True。

只有当else未被for中断时才会调用for break部分中的代码,因此在我们的情况下,设置返回的默认索引是有好处的。

答案 1 :(得分:1)

重复数字似乎不合逻辑但是如果你想获得每行的匹配数字计数而不考虑索引,那么就将nums设为一个集合,并将每行中的数字加到集合中:

from itertools import islice, imap
import csv
with open("in.txt") as f,open("numbers.txt") as nums:
    # make a set of all winning nums
    nums = set(imap(str.rstrip, nums))
    r = csv.reader(f)
    # iterate over each row and sum how many matches we get
    for row in r:
        print("{} matched {}".format(row[0], sum(n in nums
                                                 for n in islice(row, 1, None))))

使用您的输入将输出:

a matched 0
b matched 1
c matched 2
d matched 1
e matched 0
f matched 2
g matched 0
h matched 1
i matched 1
j matched 2

假设您的文件以逗号分隔,并且您的数字文件中每行都有一个数字。

如果您确实想知道哪些数字存在,那么您需要遍历该数字并打印我们集合中的每个数字:

from itertools import islice, imap
import csv

with open("in.txt") as f, open("numbers.txt") as nums:
    nums = set(imap(str.rstrip, nums))
    r = csv.reader(f)
    for row in r:
        for n in islice(row, 1, None):
            if n in nums:
                print("{} is in row {}".format(n, row[0]))
        print("")

但同样,我不确定重复数字是否合理。

要根据匹配的数量对行进行分组,可以使用总和作为键并附加第一列值:

from itertools import islice, imap
import csv
from collections import defaultdict
with open("in.txt") as f,open("numbers.txt") as nums:
    # make a set of all winning nums
    nums = set(imap(str.rstrip, nums))
    r = csv.reader(f)
    results = defaultdict(list)
    # iterate over each row and sum how many matches we get
    for row in r:
        results[sum(n in nums for n in islice(row, 1, None))].append(row[0])

结果:

defaultdict(<type 'list'>,
 {0: ['a', 'e', 'g'], 1: ['b', 'd', 'h', 'i'], 
 2: ['c', 'f', 'j']})

键是数字匹配,值是与n个数字匹配的行ID。