import csv
cred = open("AllCredits.csv", "r")
creader = csv.reader(cred)
pur = open("AllPurchases.csv", "r")
preader = csv.reader(pur)
out = open("output.txt", "r+")
for row in creader:
tn = #current phone number
crednum = #number of rows with that phone number
for row in preader:
purnum = #number of rows with that phone number
if crednum != 2*(purnum):
out.write(str(tn) + "\n")
cred.close()
pur.close()
out.close()
对于这两个文件,我只查看第一列(第0列),即电话号码。文件按电话号码排序,因此任何重复项都是彼此相邻的。我需要知道在cred文件中有多少行相同的电话号码,然后我需要知道在pur文件中有多少行具有相同的电话号码。我需要多次这样做才能比较文件之间所有重复的电话号码
例如:
Credits File
TN,STUFF,THINGS
2476,hseqer,trjar
2476,sthrtj,esreet
3654,rstrhh,trwtr
Purchases File
TN,STUFF,THINGS
2476,hseher,trjdr
3566,sthztj,esrhet
3654,rstjhh,trjtr
我需要知道的是这个例子中,信用文件中有2个实例2476个,而购买文件中有1个实例,然后信用文件中有1个3654个实例,而购买中有1个实例文件。我需要比较cred文件中的每个电话号码并获取两个文件中出现的次数,但如果pur文件中存在的电话号码不在cred文件中,我不需要计算任何东西。 (但是如果有一个数字中的2个而没有pur,我确实需要为0返回0。)请注意,真正的两个文件大小为5,000kb和13,000kb,并且有数万行。
我是蟒蛇的新手,所以我不确定最好的方法。在Python中循环肯定不同于我以前(我主要使用c ++)
我会编辑添加所需的任何内容,如果有任何需要澄清,请告诉我。这不像我之前曾经做过的任何项目,因此解释可能并不理想。
编辑:我想我可能已经跳过了解释一个重要因素,因为它包含在我的示例代码中。我需要知道这些数字只是为了比较它们,不一定要打印计数。如果crednum!= 2 * purnum,那么我想打印那个电话号码而只打印那个电话号码,否则我不需要在输出文件中看到它,而且我永远不需要实际打印计数,只需使用它们进行比较,找出需要打印的电话号码。答案 0 :(得分:2)
import csv
cred = open("AllCredits.csv", "r")
creader = csv.reader(cred)
pur = open("AllPurchases.csv", "r")
preader = csv.reader(pur)
out = open("output.txt", "r+")
def x(reader): # function takes in a reader
dictionary = {} # this is a python date type of key value pairs
for row in reader: # for each row in the reader
number = row[0] # take the first element in the row (the number)
if number == 'TN': # skip the headers
continue
number = int(number) #convert it to number now ('TN' cannot be converted which is why we do it after)
if number in dictionary: # if the number appears alreader
dictionary[number] = dictionary[number]+1 # increment it
else:
dictionary[number] = 1 # else store it in the dictionary as 1
return dictionary # return the dictionary
def assertDoubles(credits, purchases):
outstr = ''
for key in credits:
crednum = credits[key]
if crednum != 2*purchases[key]:
outstr += str(key) + '\n'
print(key)
out.write(outstr)
credits = x(creader)
purchases = x(preader)
assertDoubles(credits,purchases)
#print(credits)
#print('-------')
#print(purchases)
cred.close()
pur.close()
out.close()
我写了一些代码。它实质上将您要查找重复项的数字存储为字典中的键。存储的值是文件中该数字的出现次数。它会跳过第一行(标题)。
输出如下:
{2476: 2, 3654: 1}
-------
{2476: 1, 3654: 1, 3566: 1}
上面的新代码只输出: 3654
编辑:我更新了代码以修复您所指的内容。
答案 1 :(得分:1)
由于您对新条目不感兴趣,您只需要运行第一个文件并收集第一列中的所有条目(在过程中计算它们),然后运行第二个文件,检查是否有它的第一个列条目是在第一步中收集的,如果是这样的话 - 也可以计算它们。您无法避免运行必要数量的循环来读取两个文件的所有行,但您可以使用散列映射(name
)进行快速查找,因此:
dict
现在您已经计算了import csv
import collections
c_phones = collections.defaultdict(int) # initiate a 'counter' dict to save us some typing
with open("AllCredits.csv", "r") as f: # open the file for reading
reader = csv.reader(f) # create a CSV reader
next(reader) # skip the first row (header)
for row in reader: # iterate over the rest
c_phones[row[0]] += 1 # increase the count of the current phone
字典中存储的第一个文件中的所有电话号码,您应该克隆它但重置计数器,以便您可以计算第二个CSV文件中这些数字的出现次数:
c_phones
现在您有两个字典,并且您有两个字数,您可以轻松地迭代它们以打印出计数
p_phones = {key: 0 for key in c_phones} # reset the phone counter for purchases
with open("AllPurchases.csv", "r") as f: # open the file for reading
reader = csv.reader(f) # create a CSV reader
next(reader) # skip the first row (header)
for row in reader: # iterate over the rest
if row[0] in p_phones: # we're only interested in phones from both files
p_phones[row[0]] += 1 # increase the counter
使用您的示例数据,将产生:
3654 Credits: 1 Purchases: 1 2476 Credits: 2 Purchases: 1
答案 2 :(得分:0)
为了帮助我理解,我将这个问题分解为更小,更易于管理的任务:
阅读电话号码是一种可重复使用的功能,所以我们将其分开:
def read_phone_numbers(file_path):
file_obj = open(file_path, 'r')
phone_numbers = []
for row in csv.reader(file_obj):
phone_numbers.append(row[0])
file_obj.close()
return phone_numbers
对于查找重复项的任务,set()
是一个有用的工具。 来自python docs:
集合是无序集合,没有重复元素。
def find_duplicates(credit_nums, purchase_nums):
phone_numbers = set(credit_nums) # the unique credit numbers
duplicates = []
for phone_number in phone_numbers:
credit_count = credit_nums.count(phone_number)
purchase_count = purchase_nums.count(phone_number)
if credit_count > 0 and purchase_count > 0:
duplicates.append({
'phone_number': phone_number,
'credit_count': credit_count,
'purchase_count': purchase_count,
})
return duplicates
并将它们放在一起:
def main(credit_csv_path, purchase_csv_path, out_csv_path):
credit_nums = read_phone_numbers(credit_csv_path)
purchase_nums = read_phone_numbers(purchase_csv_path)
duplicates = find_duplicates(credit_nums, purchase_nums)
with open(out_csv_path, 'w') as file_obj:
writer = csv.DictWriter(
file_obj,
fieldnames=['phone_number', 'credit_count', 'purchase_count'],
)
writer.writerows(duplicates)
如果您需要处理数百倍的文件,可以查看the collections.Counter
module。
答案 3 :(得分:0)
我理解你的情况的方式是你有两个文件,即cred和pur。
现在对于每个信用证中的tn,找出pur中是否存在相同的tn。如果存在则返回计数,如果不存在则返回0.
您可以使用pandas,算法可以如下:
以下是ex:
import pandas as pd
# read the csv
# i create my own as suggested in your desc
cred = pd.DataFrame(
dict(
TN = [2476, 2476, 3654],
STUFF = ['hseqer', 'sthrtj', 'rstrhh'],
THINGS = ['trjar', 'esreet', 'trwtr']
),
columns = ['TN','STUFF','THINGS']
)
pur = pd.DataFrame(
dict(
TN = [2476, 3566, 3654, 2476],
STUFF = ['hseher', 'sthztj', 'rstjhh', 'hseher'],
THINGS = ['trjdr', 'esrhet', 'trjtr', 'trjdr']
),
columns = ['TN','STUFF','THINGS']
)
dfpur = pur.groupby('TN').TN.count() # agg and count (step 1)
# step 2
count = []
for row, tnval in enumerate(cred.TN):
if cred.at[row, 'TN'] in dfpur.index:
count.append(dfpur[tnval])
else:
count.append(0)
你去吧!你在列表中有你的计数