比较两个文本文件,找出python中的相关单词

时间:2015-01-14 06:55:54

标签: python

我有两个名为search.txt和log.txt的文本文件,其中包含一些数据,如下所示。

search.txt

19:00:15  , mouse , FALSE
19:00:15  , branded luggage bags and trolley , TRUE
19:00:15  , Leather shoes for men , FALSE
19:00:15  , printers , TRUE
19:00:16  , adidas watches for men , TRUE
19:00:16  , Mobile Charger Stand/Holder black , FALSE
19:00:16  , watches for men , TRUE

log.txt的

19:00:00 ,  trakjkfsa,
19:00:00 ,  door,
19:00:00 ,  sweater,
19:00:00 ,  sweater,
19:00:00 ,  sweater,
19:00:00 ,  dis,
19:00:01 ,  not,
19:00:01 ,  nokia,
19:00:01 ,  collar,
19:00:01 ,  nokia,
19:00:01 ,  collar,
19:00:01 ,  gsm,
19:00:01 ,  sweater,
19:00:01 ,  sweater,
19:00:01 ,  gsm,
19:00:02 ,  gsm,
19:00:02 ,  show,
19:00:02 ,  wayfreyerv,
19:00:02 ,  door,
19:00:02 ,  collar,
19:00:02 ,  or,
19:00:02 ,  harman,
19:00:02 ,  women's,
19:00:02 ,  collar,
19:00:02 ,  sweater,
19:00:02 ,  head,
19:00:03 ,  womanw,
19:00:03 ,  com.shopclues.utils.k@42233ff0,
19:00:03 ,  samsu,
19:00:03 ,  adidas,
19:00:03 ,  collar,
19:00:04 ,  ambas,
19:00:04 ,  harman,
19:00:04 ,  mi,
19:00:04 ,  nor,
19:00:04 ,  airtel,
19:00:04 ,  ,
19:00:04 ,  adidas,
19:00:05 ,  harman,
19:00:05 ,  collar,
19:00:05 ,  flip,
19:00:05 ,  brass,
19:00:05 ,  laptop,
19:00:05 ,  collar,
19:00:05 ,  wayfreyer,
19:00:05 ,  head,
19:00:05 ,  adidas,
19:00:05 ,  discn,
19:00:05 ,  head,
19:00:05 ,  adidas,
19:00:05 ,  collar,
19:00:05 ,  collar,
19:00:06 ,  disco,
19:00:06 ,  head,
19:00:06 ,  harman,
19:00:06 ,  nigh,
19:00:06 ,  microsoft,
19:00:06 ,  ambassado,
19:00:07 ,  salwar,
19:00:07 ,  bb,
19:00:07 ,  harman,
19:00:07 ,  ambassador,
19:00:07 ,  ambassador,
19:00:07 ,  salwar,
19:00:08 ,  microsoft,
19:00:08 ,  ac,
19:00:08 ,  jea,
19:00:08 ,  gens, 
19:00:08 ,  ambassador,
19:00:08 ,  orpa,
19:00:09 ,  ac,
19:00:09 ,  black,
19:00:09 ,  asus,
19:00:09 ,  salwar,
19:00:09 ,  salwar,
19:00:09 ,  ac,
19:00:10 ,  whechains,
19:00:10 ,  gens,
19:00:10 ,  ambassador,
19:00:10 ,  sony,
19:00:10 ,  salwa,
19:00:10 ,  ac,
19:00:10 ,  woman,
19:00:10 ,  li,
19:00:11 ,  boxers,
19:00:11 ,  harman,
19:00:11 ,  sal,
19:00:11 ,  ambassador,
19:00:11 ,  sony, 
19:00:11 ,  ,
19:00:11 ,  boxers,
19:00:12 ,  adidas,
19:00:12 ,  samsung,
19:00:12 ,  boxer,
19:00:12 ,  boxers,
19:00:12 ,  com.shopclues.utils.k@427b9538,
19:00:12 ,  harman,
19:00:12 ,  wechains#002,
19:00:12 ,  collar,
19:00:13 ,  collar,
19:00:13 ,  collar,
19:00:13 ,  one,
19:00:13 ,  collar,
19:00:13 ,  ambassador,
19:00:13 ,  hitech,
19:00:13 ,  fanc,
19:00:13 ,  adidas,
19:00:13 ,  bp,
19:00:13 ,  asus,
19:00:13 ,  ambassador,
19:00:13 ,  harman,
19:00:14 ,  lin,
19:00:14 ,  one,
19:00:14 ,  samsung,
19:00:14 ,  cond,
19:00:14 ,  atx,
19:00:15 ,  blackles#002,
19:00:15 ,  woman,
19:00:15 ,  asus,
19:00:15 ,  airtel,
19:00:15 ,  weel,
19:00:15 ,  aenglish,
19:00:15 ,  orpat,
19:00:15 ,  one,
19:00:15 ,  condom,
19:00:15 ,  one,
19:00:15 ,  ling,
19:00:15 ,  fancy,
19:00:15 ,  orpat,
19:00:15 ,  woman,
19:00:19 , watches fo,

从此我需要做的是,我必须打开两个文件,如果从search.txt中选择第一个查询,则必须从搜索文件开始,它将转到log.txt并搜索与该查询之间的任何查询:前后60秒。如果它找到与搜索查询相关的任何内容,那么它将使用列表存储数据并附加search.txt。

o / p应该如下所示: -

search.txt

19:00:15  , mouse , FALSE - []
19:00:15  , branded luggage bags and trolley , TRUE - []
19:00:15  , Leather shoes for men , FALSE - []
19:00:15  , printers , TRUE - []
19:00:16  , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas,adidas]
19:00:16  , Mobile Charger Stand/Holder black , FALSE - []
19:00:16  , watches for men , TRUE

我们举一个例子:  如果“mouse”是从search.txt放置在“19:00:15”的查询,那么它需要转到log.txt并在“18:59”之间找到与“鼠标”相关的查询:15 - 19:01:15“表示在search.txt之前和之后60秒,如果有任何与之相关的查询,那么它将使用列表将数据存储在该行的search.txt中。

下面是代码:

import datetime
from collections import defaultdict

def getting_partial_queries(querylist):
     basequery = ' '.join(querylist)                
     querylist = []
     for n in range(2,len(basequery)+1):   
         querylist.append(basequery[:n])
     return querylist
queries_time = defaultdict(list)  
with open('logs.txt') as f:            
   for line in f:
      fields = [ x.strip() for x in line.split(',') ]  
      timestamp = datetime.datetime.strptime(fields[0], "%H:%M:%S") 
      queries_time[fields[1]].append(timestamp)  
with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
 for line in inputf:
    fields = [ x.strip() for x in line.split(',') ]   
    timestamp = datetime.datetime.strptime(fields[0], "%H:%M:%S") 
    queries = getting_partial_queries(fields[1].split()) 
    results = []
    for q in queries:
        poss_timestamps = queries_time[q] 
        for ts in poss_timestamps:
            if timestamp - datetime.timedelta(seconds=60) <= ts <= timestamp:
                results.append(q)   
            if timestamp + datetime.timedelta(seconds=60) >= ts >= timestamp:
                results.append(q)   
    outputf.write (line.strip() + " , {}\n".format(results))

2 个答案:

答案 0 :(得分:1)

  1. 阅读log.txt文件,并使用split()方法和collections模块从此文件中获取所有关键字。定位日志文件每行的第二个字。
  2. 现在我们所有关键字都带有计数器。
  3. 逐行阅读search.txt文件。
  4. 从每一行获取目标字,即按,分割第二个字。
  5. 使用filterlambda从所选文字(4)
  6. 中搜索关键字
  7. 从我们的字典中获取Count值,并使用字符串格式和join方法根据需要创建新行。
  8. 将创建行写入新文件。
  9. 代码:

    p1 = "/home/infogrid/Desktop/search.txt"
    p2 = "/home/infogrid/Desktop/log.txt"
    p3 = "/home/infogrid/Desktop/search_output.txt"
    
    from collections import Counter
    
    cnt = Counter()
    with open(p2, "rb") as fp:
        for i in fp.readlines():
            cnt[(i.split(",")[1].strip())] += 1
    search_keys = cnt.keys()
    
    with open(p1, "rb") as fp:
        with open(p3,"wb") as fp3:
            for i in fp.readlines():
                i = i.strip()
                tmp = i.split(",")[1].strip()
                tmp1 = filter(lambda x: x in tmp, search_keys)
                fp3.write("%s - [%s]\n"%\
                          (i, ",".join([",".join([j]*cnt[j]) for j in tmp1])))
    

    输出:

    19:00:15  , mouse , FALSE - []
    19:00:15  , branded luggage bags and trolley , TRUE - []
    19:00:15  , Leather shoes for men , FALSE - []
    19:00:15  , printers , TRUE - []
    19:00:16  , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas]
    19:00:16  , Mobile Charger Stand/Holder black , FALSE - []
    19:00:16  , watches for men , TRUE - []
    

    注意: 先试试自己。

答案 1 :(得分:1)

虽然目前仍不清楚“部分查询”的含义是什么,但下面的代码可以做到这一点,只需重新定义函数filter_out_common_queries中的部分查询即可。例如。如果您要在search.txt中查找查询的完全匹配项,则可以将# add your logic here替换为return [' '.join(querylist), ]

import datetime as dt
from collections import defaultdict

def filter_out_common_queries(querylist):
    # add your logic here
    return querylist

queries_time = defaultdict(list)  # personally, I'd use 'set' as the default factory
with open('log.txt') as f:
    for line in f:
        fields = [ x.strip() for x in line.split(',') ]
        timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
        queries_time[fields[1]].append(timestamp)  

with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
    for line in inputf:
        fields = [ x.strip() for x in line.split(',') ]
        timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
        queries = filter_out_common_queries(fields[1].split())  # "adidas watches for men" -> "adidas" "watches" "for" "men". "for" is a very generic keyword. You should do well to filter these out
        results = []
        for q in queries:
            poss_timestamps = queries_time[q]
            for ts in poss_timestamps:
                if timestamp - dt.timedelta(seconds=15) <= ts <= timestamp:
                    results.append(q)
        outputf.write(line.strip() + " - {}\n".format(results))

根据您的输入数据输出:

19:00:15  , mouse , FALSE - []
19:00:15  , branded luggage bags and trolley , TRUE - []
19:00:15  , Leather shoes for men , FALSE - []
19:00:15  , printers , TRUE - []
19:00:16  , adidas watches for men , TRUE - ['adidas', 'adidas', 'adidas', 'adidas', 'adidas', 'adidas']
19:00:16  , Mobile Charger Stand/Holder black , FALSE - ['black']
19:00:16  , watches for men , TRUE - []

说明匹配“黑色”&#39; in&#34;移动充电器支架/支架黑色&#34;被找到。这是因为在上面的代码中,我自己查找了每个单独的单词。

编辑:要实施评论,您可以像这样重新定义filter_out_common_queries

def filter_out_common_queries(querylist):
    basequery = ' '.join(querylist)
    querylist = []
    for n in range(2,len(basequery)+1):
        querylist.append(basequery[:n])
    return querylist