Question

我有一个包含300多个.txt文件的文件夹，总大小为15GB +。这些文件包含推文。每一行都是不同的推文。我有一个关键字列表，我想搜索推文。我创建了一个脚本，可以搜索列表中每个项目的每个文件的每一行。如果推文包含关键字，则将该行写入另一个文件。这是我的代码：

# Search each file for every item in keywords
print("Searching the files of " + filename + " for the appropriate keywords...")
for file in os.listdir(file_path):
    f = open(file_path + file, 'r')
    for line in f:
        for key in keywords:
            if re.search(key, line, re.IGNORECASE):
                db.write(line)

这是每行的格式：

{"created_at":"Wed Feb 03 06:53:42 +0000 2016","id":694775753754316801,"id_str":"694775753754316801","text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF","source":"\u003ca href=\"http:\/\/www.facebook.com\/twitter\" rel=\"nofollow\"\u003eFacebook\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":5981342,"id_str":"5981342","name":"Lava Kafle","screen_name":"lkafle","location":"Kathmandu, Nepal","url":"http:\/\/about.me\/lavakafle","description":"@deerwalkinc 24000+ tweeps bigdata  #Team #Genomics  http:\/\/deerwalk.com #Genetic #Testing #population #health #management #BigData #Analytics #java #hadoop","protected":false,"verified":false,"followers_count":24742,"friends_count":23169,"listed_count":1481,"favourites_count":147252,"statuses_count":171880,"created_at":"Sat May 12 04:49:14 +0000 2007","utc_offset":20700,"time_zone":"Kathmandu","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_link_color":"088253","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/677805092859420672\/kzoS-GZ__normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/5981342\/1416802075","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[{"url":"https:\/\/t.co\/MqmDwbCDAF","expanded_url":"http:\/\/fb.me\/Yj1JW9bJ","display_url":"fb.me\/Yj1JW9bJ","indices":[45,68]}],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"low","lang":"en","timestamp_ms":"1454482422661"}

该脚本有效但需要花费大量时间。对于约40个关键字，它需要超过2个小时。显然我的代码没有优化。我该怎么做才能提高速度？

P.S。我已经阅读了一些有关搜索和速度的相关问题，但我怀疑我的脚本中存在的问题在于我正在使用关键字列表。我已经尝试了一些建议的解决方案，但无济于事。

Answer 1

1）外部图书馆

如果您愿意依赖外部库（并且执行时间比安装的一次性时间成本更重要），您可以通过将每个文件加载到一个简单的Pandas DataFrame中来获得一些速度。执行关键字搜索作为向量操作。要获得匹配的推文，您可以执行以下操作：

import pandas as pd
dataframe_from_text = pd.read_csv("/path/to/file.txt")
matched_tweets_index =  dataframe_from_text.str.match("keyword_a|keyword_b")
dataframe_from_text[matched_tweets_index] # Uses the boolean search above to filter the full dataframe
# You'd then have a mini dataframe of matching tweets in `dataframe_from_text`. 
# You could loop through these to save them out to a file using the `.to_dict(orient="records")` format.

Pandas中的数据帧操作非常快，因此可能值得研究。

2）对正则表达式进行分组

看起来您没有记录您匹配的关键字。如果是这样，您可以将关键字分组到单个正则表达式查询中，如下所示：

for line in f:
    keywords_combined = "|".join(keywords)
    if re.search(keywords_combined, line, re.IGNORECASE):
        db.write(line)

我没有对此进行测试，但是通过减少每行的循环次数，可以缩短一些时间。

Answer 2

为什么慢？

你是正则表达式搜索json转储，这并不总是一个好主意。例如，如果关键字包含用户，时间，个人资料和图像等字词，则每一行都会产生匹配，因为推文的json格式将所有这些字词都作为字典键。

除了原始JSON很大，每条推文的大小都超过1kb（这个是2.1kb），但是你的样本中唯一相关的部分是：

"text":"me with Dibyabhumi Multiple College students https:\/\/t.co\/MqmDwbCDAF",

这不到100个字节，尽管最近对API进行了更改，但典型的推文仍然少于140个字符。

要尝试的事情：

根据Padraic Cunningham

的建议预编译正则表达式

选项1.将此数据加载到postgresql JSONB字段中。 JSONB字段是可索引的，可以快速搜索

选项2.将其加载到任何旧数据库中，文本字段的上下文具有自己的列，以便可以轻松搜索此列。

选项3.最后但并非最不重要的是，只将text字段提取到其自己的文件中。您可以拥有一个CSV文件，其中第一列是屏幕名称，第二列是推文的文本。你的15GB将缩减到大约1GB

简而言之，您现在正在做的是在您只需搜索大海捞针时搜索整个农场。

使用Python中的列表来查找大文件 - 如何提高速度？

2 个答案:

1）外部图书馆

2）对正则表达式进行分组