Question

我有一个Python脚本，用于解析大型文档中的电子邮件。该脚本正在使用我计算机上的所有RAM，并使其锁定在必须重新启动的位置。我想知道是否有办法限制它，甚至在读取完一个文件并提供一些输出后，甚至可以暂停一下。任何帮助都将非常感谢。

#!/usr/bin/env python

# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
# - Does not save to file (pipe the output to a file if you want it saved).
# Twitter @Critical24 - DefensiveThinking.io 


from optparse import OptionParser
import os.path
import re

regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
                    "{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
                    "\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

def file_to_str(filename):
    """Returns the contents of filename as a string."""
    with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
    return f.read().lower() # Case is lowered to prevent regex mismatches.

def get_emails(s):
    """Returns an iterator of matched emails found in string s."""
    # Removing lines that start with '//' because the regular expression
    # mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
    return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

import os
not_parseble_files = ['.txt', '.csv']
for root, dirs, files in os.walk('.'):#This recursively searches all sub directories for files
for file in files:
    _,file_ext = os.path.splitext(file)#Here we get the extension of the file
    file_path = os.path.join(root,file)
    if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files'
       print("File %s is not parseble"%file_path)
       continue #This one continues the loop to the next file
    if os.path.isfile(file_path):
        for email in get_emails(file_to_str(file_path)):
            print(email)

Answer 1

我认为您应该尝试使用此resource模块：

SET @sql = NULL;
SELECT
  GROUP_CONCAT(DISTINCT
    CONCAT(
      ' MAX(CASE WHEN product = ''',
      product,
      ''' THEN planqty END) ''',
      product , ''''
    )
  ) INTO @sql
FROM tf4i;



SET @sql = CONCAT('SELECT tf4hid,
                          date, ', @sql, ' 
                   FROM tf4i
                   GROUP BY tf4hid,date');

PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;

Answer 2

似乎您正在使用f.read()将最大8 GB的文件读取到内存中。相反，您可以尝试将正则表达式应用于文件的每一行，而不必将整个文件存储在内存中。

with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
    return (email[0] for line in f
                     for email in re.findall(regex, line.lower())
                     if not email[0].startswith('//'))

尽管如此，这仍然需要很长时间。另外，我也没有检查您的正则表达式是否存在问题。

Python脚本使用了所有RAM

2 个答案: