Python - 循环使用两个csv文件来比较每个文件中重复条目的数量

时间:2018-04-06 01:42:52

标签: python csv

import csv

cred = open("AllCredits.csv", "r")
creader = csv.reader(cred)

pur = open("AllPurchases.csv", "r")
preader = csv.reader(pur)

out = open("output.txt", "r+")

for row in creader:
    tn = #current phone number
    crednum = #number of rows with that phone number
        for row in preader:
            purnum = #number of rows with that phone number
            if crednum != 2*(purnum):
                out.write(str(tn) + "\n")

cred.close()
pur.close()
out.close()

对于这两个文件,我只查看第一列(第0列),即电话号码。文件按电话号码排序,因此任何重复项都是彼此相邻的。我需要知道在cred文件中有多少行相同的电话号码,然后我需要知道在pur文件中有多少行具有相同的电话号码。我需要多次这样做才能比较文件之间所有重复的电话号码

例如:

    Credits File
 TN,STUFF,THINGS
 2476,hseqer,trjar
 2476,sthrtj,esreet
 3654,rstrhh,trwtr

    Purchases File
 TN,STUFF,THINGS
 2476,hseher,trjdr
 3566,sthztj,esrhet
 3654,rstjhh,trjtr

我需要知道的是这个例子中,信用文件中有2个实例2476个,而购买文件中有1个实例,然后信用文件中有1个3654个实例,而购买中有1个实例文件。我需要比较cred文件中的每个电话号码并获取两个文件中出现的次数,但如果pur文件中存在的电话号码不在cred文件中,我不需要计算任何东西。 (但是如果有一个数字中的2个而没有pur,我确实需要为0返回0。)请注意,真正的两个文件大小为5,000kb和13,000kb,并且有数万行。

我是蟒蛇的新手,所以我不确定最好的方法。在Python中循环肯定不同于我以前(我主要使用c ++)

我会编辑添加所需的任何内容,如果有任何需要澄清,请告诉我。这不像我之前曾经做过的任何项目,因此解释可能并不理想。

编辑:我想我可能已经跳过了解释一个重要因素,因为它包含在我的示例代码中。我需要知道这些数字只是为了比较它们,不一定要打印计数。如果crednum!= 2 * purnum,那么我想打印那个电话号码而只打印那个电话号码,否则我不需要在输出文件中看到它,而且我永远不需要实际打印计数,只需使用它们进行比较,找出需要打印的电话号码。

4 个答案:

答案 0 :(得分:2)

import csv

cred = open("AllCredits.csv", "r")
creader = csv.reader(cred)

pur = open("AllPurchases.csv", "r")
preader = csv.reader(pur)

out = open("output.txt", "r+")

def x(reader):  # function takes in a reader 
    dictionary = {} # this is a python date type of key value pairs
    for row in reader: # for each row in the reader
        number = row[0] # take the first element in the row (the number)
        if number == 'TN': # skip the headers
            continue
        number = int(number) #convert it to number now ('TN' cannot be converted which is why we do it after)
        if number in dictionary: # if the number appears alreader
            dictionary[number] = dictionary[number]+1 # increment it
        else:
            dictionary[number] = 1 # else store it in the dictionary as 1
    return dictionary # return the dictionary

def assertDoubles(credits, purchases):
    outstr = ''
    for key in credits:
        crednum = credits[key]
        if crednum != 2*purchases[key]:
            outstr += str(key) + '\n'
            print(key)
    out.write(outstr)

credits = x(creader)
purchases = x(preader)

assertDoubles(credits,purchases)


#print(credits)
#print('-------')
#print(purchases)

cred.close()
pur.close()
out.close()
我写了一些代码。它实质上将您要查找重复项的数字存储为字典中的键。存储的值是文件中该数字的出现次数。它会跳过第一行(标题)。

输出如下:

{2476: 2, 3654: 1}
-------
{2476: 1, 3654: 1, 3566: 1}

上面的新代码只输出:     3654

编辑:我更新了代码以修复您所指的内容。

答案 1 :(得分:1)

由于您对新条目不感兴趣,您只需要运行第一个文件并收集第一列中的所有条目(在过程中计算它们),然后运行第二个文件,检查是否有它的第一个列条目是在第一步中收集的,如果是这样的话 - 也可以计算它们。您无法避免运行必要数量的循环来读取两个文件的所有行,但您可以使用散列映射(name)进行快速查找,因此:

dict

现在您已经计算了import csv import collections c_phones = collections.defaultdict(int) # initiate a 'counter' dict to save us some typing with open("AllCredits.csv", "r") as f: # open the file for reading reader = csv.reader(f) # create a CSV reader next(reader) # skip the first row (header) for row in reader: # iterate over the rest c_phones[row[0]] += 1 # increase the count of the current phone 字典中存储的第一个文件中的所有电话号码,您应该克隆它但重置计数器,以便您可以计算第二个CSV文件中这些数字的出现次数:

c_phones

现在您有两个字典,并且您有两个字数,您可以轻松地迭代它们以打印出计数

p_phones = {key: 0 for key in c_phones}  # reset the phone counter for purchases

with open("AllPurchases.csv", "r") as f:  # open the file for reading
    reader = csv.reader(f)  # create a CSV reader
    next(reader)  # skip the first row (header)
    for row in reader:  # iterate over the rest
        if row[0] in p_phones:  # we're only interested in phones from both files
            p_phones[row[0]] += 1  # increase the counter

使用您的示例数据,将产生:

3654            Credits: 1    Purchases: 1   
2476            Credits: 2    Purchases: 1 

答案 2 :(得分:0)

为了帮助我理解,我将这个问题分解为更小,更易于管理的任务:

  • 从两个已排序的csv文件的第一列中读取电话号码。
  • 查找两个电话号码列表中显示的重复数字。

阅读电话号码是一种可重复使用的功能,所以我们将其分开:

def read_phone_numbers(file_path):
    file_obj = open(file_path, 'r')

    phone_numbers = []
    for row in csv.reader(file_obj):
        phone_numbers.append(row[0])

    file_obj.close()
    return phone_numbers

对于查找重复项的任务,set()是一个有用的工具。 来自python docs:

  

集合是无序集合,没有重复元素。

def find_duplicates(credit_nums, purchase_nums):
    phone_numbers = set(credit_nums)  # the unique credit numbers
    duplicates = []

    for phone_number in phone_numbers:
        credit_count = credit_nums.count(phone_number)
        purchase_count = purchase_nums.count(phone_number)

        if credit_count > 0 and purchase_count > 0:
            duplicates.append({
                'phone_number': phone_number,
                'credit_count': credit_count,
                'purchase_count': purchase_count,
            })

    return duplicates

并将它们放在一起:

def main(credit_csv_path, purchase_csv_path, out_csv_path):
    credit_nums = read_phone_numbers(credit_csv_path)
    purchase_nums = read_phone_numbers(purchase_csv_path)
    duplicates = find_duplicates(credit_nums, purchase_nums)

    with open(out_csv_path, 'w') as file_obj:
        writer = csv.DictWriter(
            file_obj,
            fieldnames=['phone_number', 'credit_count', 'purchase_count'],
        )
        writer.writerows(duplicates)

如果您需要处理数百倍的文件,可以查看the collections.Counter module

答案 3 :(得分:0)

我理解你的情况的方式是你有两个文件,即cred和pur。

现在对于每个信用证中的tn,找出pur中是否存在相同的tn。如果存在则返回计数,如果不存在则返回0.

您可以使用pandas,算法可以如下:

  1. TN和计数
  2. 对于cred中的每一行,获取计数。其他0
  3. 以下是ex:

    import pandas as pd
    
    # read the csv
    # i create my own as suggested in your desc
    cred = pd.DataFrame(
            dict(
                TN = [2476, 2476, 3654],
                STUFF = ['hseqer', 'sthrtj', 'rstrhh'],
                THINGS = ['trjar', 'esreet', 'trwtr']
            ),
            columns = ['TN','STUFF','THINGS']
            )
    
    pur = pd.DataFrame(
            dict(
                TN = [2476, 3566, 3654, 2476],
                STUFF = ['hseher', 'sthztj', 'rstjhh', 'hseher'],
                THINGS = ['trjdr', 'esrhet', 'trjtr', 'trjdr']
            ),
            columns = ['TN','STUFF','THINGS']
            )
    
    dfpur = pur.groupby('TN').TN.count() # agg and count (step 1)
    
    # step 2
    count = []
    for row, tnval in enumerate(cred.TN):
        if cred.at[row, 'TN'] in dfpur.index:
            count.append(dfpur[tnval])
        else:
            count.append(0)
    
    你去吧!你在列表中有你的计数