我有两个名为search.txt和log.txt的文本文件,其中包含一些数据,如下所示。
search.txt
19:00:15 , mouse , FALSE
19:00:15 , branded luggage bags and trolley , TRUE
19:00:15 , Leather shoes for men , FALSE
19:00:15 , printers , TRUE
19:00:16 , adidas watches for men , TRUE
19:00:16 , Mobile Charger Stand/Holder black , FALSE
19:00:16 , watches for men , TRUE
log.txt的
19:00:00 , trakjkfsa,
19:00:00 , door,
19:00:00 , sweater,
19:00:00 , sweater,
19:00:00 , sweater,
19:00:00 , dis,
19:00:01 , not,
19:00:01 , nokia,
19:00:01 , collar,
19:00:01 , nokia,
19:00:01 , collar,
19:00:01 , gsm,
19:00:01 , sweater,
19:00:01 , sweater,
19:00:01 , gsm,
19:00:02 , gsm,
19:00:02 , show,
19:00:02 , wayfreyerv,
19:00:02 , door,
19:00:02 , collar,
19:00:02 , or,
19:00:02 , harman,
19:00:02 , women's,
19:00:02 , collar,
19:00:02 , sweater,
19:00:02 , head,
19:00:03 , womanw,
19:00:03 , com.shopclues.utils.k@42233ff0,
19:00:03 , samsu,
19:00:03 , adidas,
19:00:03 , collar,
19:00:04 , ambas,
19:00:04 , harman,
19:00:04 , mi,
19:00:04 , nor,
19:00:04 , airtel,
19:00:04 , ,
19:00:04 , adidas,
19:00:05 , harman,
19:00:05 , collar,
19:00:05 , flip,
19:00:05 , brass,
19:00:05 , laptop,
19:00:05 , collar,
19:00:05 , wayfreyer,
19:00:05 , head,
19:00:05 , adidas,
19:00:05 , discn,
19:00:05 , head,
19:00:05 , adidas,
19:00:05 , collar,
19:00:05 , collar,
19:00:06 , disco,
19:00:06 , head,
19:00:06 , harman,
19:00:06 , nigh,
19:00:06 , microsoft,
19:00:06 , ambassado,
19:00:07 , salwar,
19:00:07 , bb,
19:00:07 , harman,
19:00:07 , ambassador,
19:00:07 , ambassador,
19:00:07 , salwar,
19:00:08 , microsoft,
19:00:08 , ac,
19:00:08 , jea,
19:00:08 , gens,
19:00:08 , ambassador,
19:00:08 , orpa,
19:00:09 , ac,
19:00:09 , black,
19:00:09 , asus,
19:00:09 , salwar,
19:00:09 , salwar,
19:00:09 , ac,
19:00:10 , whechains,
19:00:10 , gens,
19:00:10 , ambassador,
19:00:10 , sony,
19:00:10 , salwa,
19:00:10 , ac,
19:00:10 , woman,
19:00:10 , li,
19:00:11 , boxers,
19:00:11 , harman,
19:00:11 , sal,
19:00:11 , ambassador,
19:00:11 , sony,
19:00:11 , ,
19:00:11 , boxers,
19:00:12 , adidas,
19:00:12 , samsung,
19:00:12 , boxer,
19:00:12 , boxers,
19:00:12 , com.shopclues.utils.k@427b9538,
19:00:12 , harman,
19:00:12 , wechains#002,
19:00:12 , collar,
19:00:13 , collar,
19:00:13 , collar,
19:00:13 , one,
19:00:13 , collar,
19:00:13 , ambassador,
19:00:13 , hitech,
19:00:13 , fanc,
19:00:13 , adidas,
19:00:13 , bp,
19:00:13 , asus,
19:00:13 , ambassador,
19:00:13 , harman,
19:00:14 , lin,
19:00:14 , one,
19:00:14 , samsung,
19:00:14 , cond,
19:00:14 , atx,
19:00:15 , blackles#002,
19:00:15 , woman,
19:00:15 , asus,
19:00:15 , airtel,
19:00:15 , weel,
19:00:15 , aenglish,
19:00:15 , orpat,
19:00:15 , one,
19:00:15 , condom,
19:00:15 , one,
19:00:15 , ling,
19:00:15 , fancy,
19:00:15 , orpat,
19:00:15 , woman,
19:00:19 , watches fo,
从此我需要做的是,我必须打开两个文件,如果从search.txt中选择第一个查询,则必须从搜索文件开始,它将转到log.txt并搜索与该查询之间的任何查询:前后60秒。如果它找到与搜索查询相关的任何内容,那么它将使用列表存储数据并附加search.txt。
o / p应该如下所示: -
search.txt
19:00:15 , mouse , FALSE - []
19:00:15 , branded luggage bags and trolley , TRUE - []
19:00:15 , Leather shoes for men , FALSE - []
19:00:15 , printers , TRUE - []
19:00:16 , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas,adidas]
19:00:16 , Mobile Charger Stand/Holder black , FALSE - []
19:00:16 , watches for men , TRUE
我们举一个例子: 如果“mouse”是从search.txt放置在“19:00:15”的查询,那么它需要转到log.txt并在“18:59”之间找到与“鼠标”相关的查询:15 - 19:01:15“表示在search.txt之前和之后60秒,如果有任何与之相关的查询,那么它将使用列表将数据存储在该行的search.txt中。
下面是代码:
import datetime
from collections import defaultdict
def getting_partial_queries(querylist):
basequery = ' '.join(querylist)
querylist = []
for n in range(2,len(basequery)+1):
querylist.append(basequery[:n])
return querylist
queries_time = defaultdict(list)
with open('logs.txt') as f:
for line in f:
fields = [ x.strip() for x in line.split(',') ]
timestamp = datetime.datetime.strptime(fields[0], "%H:%M:%S")
queries_time[fields[1]].append(timestamp)
with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
for line in inputf:
fields = [ x.strip() for x in line.split(',') ]
timestamp = datetime.datetime.strptime(fields[0], "%H:%M:%S")
queries = getting_partial_queries(fields[1].split())
results = []
for q in queries:
poss_timestamps = queries_time[q]
for ts in poss_timestamps:
if timestamp - datetime.timedelta(seconds=60) <= ts <= timestamp:
results.append(q)
if timestamp + datetime.timedelta(seconds=60) >= ts >= timestamp:
results.append(q)
outputf.write (line.strip() + " , {}\n".format(results))
答案 0 :(得分:1)
log.txt
文件,并使用split()
方法和collections
模块从此文件中获取所有关键字。定位日志文件每行的第二个字。search.txt
文件。,
分割第二个字。filter
和lambda
从所选文字(4)代码:
p1 = "/home/infogrid/Desktop/search.txt"
p2 = "/home/infogrid/Desktop/log.txt"
p3 = "/home/infogrid/Desktop/search_output.txt"
from collections import Counter
cnt = Counter()
with open(p2, "rb") as fp:
for i in fp.readlines():
cnt[(i.split(",")[1].strip())] += 1
search_keys = cnt.keys()
with open(p1, "rb") as fp:
with open(p3,"wb") as fp3:
for i in fp.readlines():
i = i.strip()
tmp = i.split(",")[1].strip()
tmp1 = filter(lambda x: x in tmp, search_keys)
fp3.write("%s - [%s]\n"%\
(i, ",".join([",".join([j]*cnt[j]) for j in tmp1])))
输出:
19:00:15 , mouse , FALSE - []
19:00:15 , branded luggage bags and trolley , TRUE - []
19:00:15 , Leather shoes for men , FALSE - []
19:00:15 , printers , TRUE - []
19:00:16 , adidas watches for men , TRUE - [adidas,adidas,adidas,adidas,adidas]
19:00:16 , Mobile Charger Stand/Holder black , FALSE - []
19:00:16 , watches for men , TRUE - []
注意:强> 先试试自己。
答案 1 :(得分:1)
虽然目前仍不清楚“部分查询”的含义是什么,但下面的代码可以做到这一点,只需重新定义函数filter_out_common_queries
中的部分查询即可。例如。如果您要在search.txt
中查找查询的完全匹配项,则可以将# add your logic here
替换为return [' '.join(querylist), ]
。
import datetime as dt
from collections import defaultdict
def filter_out_common_queries(querylist):
# add your logic here
return querylist
queries_time = defaultdict(list) # personally, I'd use 'set' as the default factory
with open('log.txt') as f:
for line in f:
fields = [ x.strip() for x in line.split(',') ]
timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
queries_time[fields[1]].append(timestamp)
with open('search.txt') as inputf, open('search_output.txt', 'w') as outputf:
for line in inputf:
fields = [ x.strip() for x in line.split(',') ]
timestamp = dt.datetime.strptime(fields[0], "%H:%M:%S")
queries = filter_out_common_queries(fields[1].split()) # "adidas watches for men" -> "adidas" "watches" "for" "men". "for" is a very generic keyword. You should do well to filter these out
results = []
for q in queries:
poss_timestamps = queries_time[q]
for ts in poss_timestamps:
if timestamp - dt.timedelta(seconds=15) <= ts <= timestamp:
results.append(q)
outputf.write(line.strip() + " - {}\n".format(results))
根据您的输入数据输出:
19:00:15 , mouse , FALSE - []
19:00:15 , branded luggage bags and trolley , TRUE - []
19:00:15 , Leather shoes for men , FALSE - []
19:00:15 , printers , TRUE - []
19:00:16 , adidas watches for men , TRUE - ['adidas', 'adidas', 'adidas', 'adidas', 'adidas', 'adidas']
19:00:16 , Mobile Charger Stand/Holder black , FALSE - ['black']
19:00:16 , watches for men , TRUE - []
说明匹配“黑色”&#39; in&#34;移动充电器支架/支架黑色&#34;被找到。这是因为在上面的代码中,我自己查找了每个单独的单词。
编辑:要实施评论,您可以像这样重新定义filter_out_common_queries
:
def filter_out_common_queries(querylist):
basequery = ' '.join(querylist)
querylist = []
for n in range(2,len(basequery)+1):
querylist.append(basequery[:n])
return querylist