Python多处理:读取文件并更新字典

时间:2015-10-08 05:19:17

标签: python dictionary multiprocessing

让我们假设我有一个只有2行的文本文件,如下所示:

File.txt:

100022441   @DavidBartonWB Guarding Constitution  
100022441   RT @frankgaffney 2nd Amendment Guy. 

第一列是用户ID,第二列是用户推文。我想阅读上面的文本文件并更新以下字典:

d={'100022441':{'@frankgaffney': 0, '@DavidBartonWB': 0}}. 

这是我的代码:

def f(line):
    data = line.split('\t')
    uid = data[0]
    tweet = data[1]
    if uid in d.keys():
        for gn in d[uid].keys():
            if gn in tweet:
                return uid, gn, 1
            else:
                return uid, gn, 0
p = Pool(4)
with open('~/File.txt') as source_file:
    for uid, gn, r in p.map(f, source_file):
        d[uid][gn] += r

所以基本上我需要读取文件的每一行并确定用户是否在我的字典中,如果是,那么推文是否包含字典中的用户密钥(例如'@frankgaffney'和'@DavidBartonWB') 。所以根据我上面写的两行,代码应该是:

d = {{'100022441':{'@frankgaffney': 1, '@DavidBartonWB': 1 }}

但它给出了:

d = {{'100022441':{'@frankgaffney': 1, '@DavidBartonWB': 0 }}

由于某种原因,代码总是丢失所有用户的密钥之一。知道我的代码有什么问题吗?

2 个答案:

答案 0 :(得分:0)

第二列是数据[1],而不是数据[2]

数据[2]的工作原理意味着你要分成单词,而不是列

如果要将用户密钥作为单独的单词(而不是子字符串)进行搜索,则需要tweet = data [1:]

如果要搜索子字符串,则需要将其拆分为两部分:uid,tweet = line.split(None,1)

答案 1 :(得分:0)

您的文件以制表符分隔,并且您始终在检查第三列中提及;它适用于第一次提及,因为您将整个文件传递给函数,而不是每行。所以你有效地做到了这一点:

>>> s = '100022441\t@DavidBartonWB Guarding Constitution\n100022441\tRT@frankgaffney 2nd Amendment Guy.'
>>> s.split('\t')
['100022441', '@DavidBartonWB Guarding Constitution\n100022441', 'RT@frankgaffney 2nd Amendment Guy.']

我建议采用两种方法:

  1. 将您的函数映射到文件中的每个
  2. 使用正则表达式进行更强大的搜索。
  3. 试试这个版本:

    import re
    
    d = {'100022441':{'@frankgaffney': 0, '@DavidBartonWB': 0}}
    e = r'(@\w+)'
    
    def parser(line):
       key, tweet = line.split('\t')
       data = d.get(key)
       if data:
          mentions = re.findall(e, tweet)
          for mention in mentions:
              if mention in data.keys():
                  d[key][mention] += 1
    
    with open('~/File.txt') as f:
        for line in f:
           parser(line)
    
    print(d)
    

    一旦您确认其工作正常,您就可以对其进行多重处理:

    import itertools, re
    from multiprocessing import Process, Manager
    
    def parse(queue, d, m):
        while True:
           line = queue.get()
           if line is None:
               return # we are done with this thread
           key, tweet = line.split('\t')
           data = d.get(key)
           e = r'(@\w+)'
           if data:
              mentions = re.findall(e, tweet)
              for mention in mentions:
                  if mention in data:
                      if mention not in m:
                         m[mention] = 1
                      else:
                         m[mention] += 1
    
    if __name__ == '__main__':
         workers = 2
         manager = Manager()
    
         d = manager.dict()
         d2 = manager.dict()
         d = {'100022441': ['@frankgaffney', '@DavidBartonWB']}
    
         queue = manager.Queue(workers)
    
         worker_pool = []
         for i in range(workers):
             p = Process(target=parse, args=(queue, d, d2))
             p.start()
             worker_pool.append(p)
    
         # Fill the queue with data for the workers
         with open(r'tweets2.txt') as f:
             iters = itertools.chain(f, (None,)*workers)
             for line in iters:
                 queue.put(line)
    
         for p in worker_pool:
             p.join()
    
         for i,data in d.iteritems():
           print('For ID: {}'.format(i))
           for key in data:
              print(' {} - {}'.format(key, d2[key]))