我有一个包含300多个.txt文件的文件夹,总大小为15GB +。这些文件包含推文。每一行都是不同的推文。我有一个关键字列表,我想搜索推文。我创建了一个脚本,可以搜索列表中每个项目的每个文件的每一行。如果推文包含关键字,则将该行写入另一个文件。这是我的代码:
# Search each file for every item in keywords
print("Searching the files of " + filename + " for the appropriate keywords...")
for file in os.listdir(file_path):
f = open(file_path + file, 'r')
for line in f:
for key in keywords:
if re.search(key, line, re.IGNORECASE):
db.write(line)
这是每行的格式:
{"created_at":"Wed Feb 03 06:53:42 +0000 2016","id":694775753754316801,"id_str":"694775753754316801","text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF","source":"\u003ca href=\"http:\/\/www.facebook.com\/twitter\" rel=\"nofollow\"\u003eFacebook\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":5981342,"id_str":"5981342","name":"Lava Kafle","screen_name":"lkafle","location":"Kathmandu, Nepal","url":"http:\/\/about.me\/lavakafle","description":"@deerwalkinc 24000+ tweeps bigdata #Team #Genomics http:\/\/deerwalk.com #Genetic #Testing #population #health #management #BigData #Analytics #java #hadoop","protected":false,"verified":false,"followers_count":24742,"friends_count":23169,"listed_count":1481,"favourites_count":147252,"statuses_count":171880,"created_at":"Sat May 12 04:49:14 +0000 2007","utc_offset":20700,"time_zone":"Kathmandu","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_link_color":"088253","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/5981342\/1416802075","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/MqmDwbCDAF","expanded_url":"http:\/\/fb.me\/Yj1JW9bJ","display_url":"fb.me\/Yj1JW9bJ","indices":[45,68]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1454482422661"}
该脚本有效但需要花费大量时间。对于约40个关键字,它需要超过2个小时。显然我的代码没有优化。我该怎么做才能提高速度?
P.S。我已经阅读了一些有关搜索和速度的相关问题,但我怀疑我的脚本中存在的问题在于我正在使用关键字列表。我已经尝试了一些建议的解决方案,但无济于事。
答案 0 :(得分:1)
如果您愿意依赖外部库(并且执行时间比安装的一次性时间成本更重要),您可以通过将每个文件加载到一个简单的Pandas DataFrame中来获得一些速度。执行关键字搜索作为向量操作。要获得匹配的推文,您可以执行以下操作:
import pandas as pd
dataframe_from_text = pd.read_csv("/path/to/file.txt")
matched_tweets_index = dataframe_from_text.str.match("keyword_a|keyword_b")
dataframe_from_text[matched_tweets_index] # Uses the boolean search above to filter the full dataframe
# You'd then have a mini dataframe of matching tweets in `dataframe_from_text`.
# You could loop through these to save them out to a file using the `.to_dict(orient="records")` format.
Pandas中的数据帧操作非常快,因此可能值得研究。
看起来您没有记录您匹配的关键字。如果是这样,您可以将关键字分组到单个正则表达式查询中,如下所示:
for line in f:
keywords_combined = "|".join(keywords)
if re.search(keywords_combined, line, re.IGNORECASE):
db.write(line)
我没有对此进行测试,但是通过减少每行的循环次数,可以缩短一些时间。
答案 1 :(得分:1)
为什么慢?
你是正则表达式搜索json转储,这并不总是一个好主意。例如,如果关键字包含用户,时间,个人资料和图像等字词,则每一行都会产生匹配,因为推文的json格式将所有这些字词都作为字典键。
除了原始JSON很大,每条推文的大小都超过1kb(这个是2.1kb),但是你的样本中唯一相关的部分是:
"text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF",
这不到100个字节,尽管最近对API进行了更改,但典型的推文仍然少于140个字符。
要尝试的事情:
的建议预编译正则表达式选项1.将此数据加载到postgresql JSONB字段中。 JSONB字段是可索引的,可以快速搜索
选项2.将其加载到任何旧数据库中,文本字段的上下文具有自己的列,以便可以轻松搜索此列。
选项3.最后但并非最不重要的是,只将text
字段提取到其自己的文件中。您可以拥有一个CSV文件,其中第一列是屏幕名称,第二列是推文的文本。你的15GB将缩减到大约1GB
简而言之,您现在正在做的是在您只需搜索大海捞针时搜索整个农场。