我正在尝试帮助某人。我绝不是专家程序员,但是我想做的是根据年份和来自另一个CSV的ID从一个CSV计算一个值。如果我出于时间和测试目的静态放置较小的样本大小,则该程序将按预期工作(amount_of_reviews与180mb CSV一起工作)。但是,当我希望它能工作所有数据时,我似乎从预期的20245结果中丢失了大约2000(线程之一可能会失败?)。我正在使用多处理程序来减少程序运行的时间。我将继续在这里发布所有代码,希望有经验的人可以发现我的错误。
aarch64-linux-gnu-make.exe
答案 0 :(得分:2)
这段代码看起来像一个竞争条件:
with counter.get_lock():
counter.value += 1 #I am aware I skip the 0 index here
print(counter.value)
calc(idList[counter.value])
您可以在按住counter
的同时增加它的锁定。但是,然后在idList[counter.value]
中查询锁外的计数器值。因此,与此同时,另一个线程/进程可能已更改了计数器。在这种情况下,您将从计数器中读取意外值。编写代码的安全方法是:
value = None
with counter.get_lock():
counter.value += 1 #I am aware I skip the 0 index here
value = counter
print(value)
calc(idList[value])
编辑(这是您的代码的一个版本,已删除所有竞争条件(我相信),并且还删除了文件I / O。它对我来说正常工作。也许您可以逐个添加文件I / O,看看哪里出了问题
import csv
import os
from multiprocessing import Process, Lock, Array, Value
import datetime
print (datetime.datetime.now())
idSet = set(range(20245))
idList = list(idSet)
idList = sorted(idList)
listings = []
totalCounter = Value('i', 0)
def calc(id):
listing = []
listings.append(listing)
def format_csv(data, lock):
with lock:
totalCounter.value += len(data)
def do(counter, lock):
for id in idList:
value = None
with counter.get_lock():
if counter.value < len(idList):
value = counter.value
counter.value += 1
if value is not None:
calc(idList[value])
else:
format_csv(listings, lock)
break
if __name__ == '__main__':
lock = Lock()
sharedCounter = Value('i', 0)
processes = []
for i in range(os.cpu_count()):
processes.append(Process(target=do, args=(sharedCounter, lock)))
for process in processes:
process.start()
for process in processes:
process.join()
print (datetime.datetime.now())
print('len(idList): %d, total: %d' % (len(idList), totalCounter.value))
答案 1 :(得分:1)
我建议使用熊猫来读取文件(谢谢亚历山大)。然后遍历列表并汇总所有具有特定ID且在2019年之后的评论:
import numpy as np
import pandas
import datetime
import time
listing_csv_filename = r'listings.csv'
reviews_csv_filename = r'reviews.csv'
start = time.time()
df_listing = pandas.read_csv(listing_csv_filename, delimiter=',', quotechar='"')
df_reviews = pandas.read_csv(reviews_csv_filename, delimiter=',', parse_dates=[1])
values = list()
valid_year = df_reviews['date'] > datetime.datetime(2019, 1, 1, 0, 0, 0)
for id_num in df_listing['id']:
valid = (df_reviews['listing_id'] == id_num) & valid_year
values.append((id_num, np.sum(valid)))
print(values)
print(time.time() - start)
答案 2 :(得分:1)
在没有深入了解的情况下,我想说这里有两个主要的罪魁祸首,而且两者并存:
首先,重复进行文件解析和迭代。您遍历“主循环”中的每个ID,因此要遍历20,025次。然后,对于每个ID,您阅读并遍历整个列表文件(20,051行)和整个评论文件(493,816行)。总共可读取100亿2.9亿186万675行CSV数据。
第二,本身就是多重处理。我没有对其进行深入的了解,但是我认为可以公平地说,仅通过代码就可以很好地解决问题。如上所述,对于每个ID,您的程序都会打开两个CSV文件。一堆进程都需要写入相同的两个文件(总共20,000次),对性能没有好处。如果没有多处理的代码运行得比没有多处理的代码快,我不会感到完全惊讶。 Daniel Junglas还提到了潜在的比赛条件。
好的,仍然是一团糟,但是我只是想在世纪之交之前得到一些东西。我将继续寻找更好的解决方案。除其他事项外,基于出现在评论中但不在listings.csv
中的列表数量,理想的解决方案可能会略有不同。
import numpy as np
import pandas as pd
listings_df = pd.read_csv('../resources/listings.csv', header=0, usecols=['id'], dtype={'id': str})
reviews_df = pd.read_csv('../resources/reviews.csv', header=0, parse_dates=['date'], dtype={'listing_id': str})
valid_reviews = reviews_df[reviews_df['date'] >= pd.Timestamp(year=2019, month=1, day=1)]
review_id_counts = valid_reviews['listing_id'].value_counts()
counts_res: pd.DataFrame = pd.merge(listings_df, review_id_counts, left_on='id', right_index=True, how='left').rename(columns={'listing_id': 'review_count'})
counts_res['review_count'] = counts_res['review_count'].fillna(0).astype(np.int64)
counts_res.to_csv(path_or_buf='../out/listing_review_counts.csv', index=False)
运行时间大约为1秒,这意味着我确实达到了5秒或更短的目标。耶:)
此方法使用词典对评论和标准的csv模块进行计数。请记住,如果评论是针对不在listings.csv
中的商家信息,则会引发错误。
import csv
import datetime
with open('../resources/listings.csv') as listings_file:
reader = csv.DictReader(listings_file)
listing_review_counts = dict.fromkeys((row['id'] for row in reader), 0)
cutoff_date = datetime.date(2019, 1, 1)
with open('../resources/reviews.csv') as reviews_file:
reader = csv.DictReader(reviews_file)
for row in reader:
rev_date = datetime.datetime.fromisoformat(row['date']).date()
if rev_date >= cutoff_date:
listing_review_counts[row['listing_id']] += 1
with open('../out/listing_review_counts_2.csv', 'w', newline='') as out_file:
writer = csv.writer(out_file)
writer.writerow(('id', 'review_count'))
writer.writerows(listing_review_counts.items())
此方法使用collections.Counter
和标准的csv模块。
import collections as colls
import csv
import datetime
cutoff_date = datetime.date(2019, 1, 1)
with open('../resources/reviews.csv') as reviews_file:
reader = csv.DictReader(reviews_file)
review_listing_counts = colls.Counter(
(row['listing_id'] for row in reader if datetime.datetime.fromisoformat(row['date']).date() >= cutoff_date))
with open('../resources/listings.csv') as listings_file, open('../out/listing_review_counts_3.csv', 'w',
newline='') as out_file:
reader = csv.DictReader(listings_file)
listings_ids = (row['id'] for row in reader)
writer = csv.writer(out_file)
writer.writerow(('id', 'review_count'))
writer.writerows(((curr_id, review_listing_counts[curr_id]) for curr_id in listings_ids))
让我知道您是否有任何疑问,是否应该提供一些解释等等:)