Question

我是一名新的程序员，我们正在研究一个研究生英语项目，我们正在尝试解析一个巨大的字典文本文件（500 MB）。该文件设置了类似html的标签。我有179个作者标签，例如。 “[A＆gt;]震撼。[/ A]”对于莎士比亚而言我需要做的是找到每个标签的每一个出现，然后写下该标签以及后面的内容，直到我得到“[/ W]”。

我的问题是readlines（）给了我一个内存错误（我假设因为文件太大了）我已经能够找到匹配（但只有一次）并且无法让它看过去第一场比赛。任何人都可以给予任何帮助将不胜感激。

我认为文本文件中没有新行会导致问题。这个问题已经解决了。我以为我会包含有效的代码：

with open('/Users/Desktop/Poetrylist.txt','w') as output_file:
with open('/Users/Desktop/2e.txt','r') as open_file:
    the_whole_file = open_file.read()
    start_position = 0
    while True:
        start_position = the_whole_file.find('<A>', start_position)
        if start_position < 0:
            break
        start_position += 3
        end_position = the_whole_file.find('</W>', start_position)
        output_file.write(the_whole_file[start_position:end_position])
        output_file.write("\n")    
        start_position = end_position + 4

Answer 1

打开文件后，迭代这样的行：

input_file = open('huge_file.txt', 'r')
for input_line in input_file:
   # process the line however you need - consider learning some basic regular expressions

这将允许您根据需要逐行读取文件，而不是一次性将其全部加载到内存中，从而轻松处理文件

Answer 2

我不太了解正则表达式，但你可以使用字符串方法find（）和行切片来解决这个问题。

answer = ''

with open('yourFile.txt','r') as open_file, open('output_file','w') as output_file:
    for each_line in open_file:
        if each_line.find('[A>]'):
            start_position = each_line.find('[A>]')
            start_position = start_position + 3
            end_position = each_line[start_position:].find('[/W]')

            answer = each_line[start_position:end_position] + '\n'
            output_file.write(answer)

让我解释一下发生了什么：

使用= []创建一个空的'list'。这将保留你的答案。
使用with ...语句。这允许您将文件作为别名打开（我选择了open_file）。无论程序是否正确运行，这都可确保自动关闭文件。
我们使用'for line in file：'idiom来一次处理一行文件。 'line'变量可以命名为任何名称（例如，对于文件中的x，文件中的比萨），并且将每行作为字符串包含在内。当它到达文件末尾时，它会自动停止。
'if each_line.find（'[A＆gt;]'）：'语句只是测试起始标记是否在该行中。如果不是，则后面的缩进代码都不会运行，循环将重新启动，移动到下一行。
我们使用字符串切片，我们可以删除我们想要的字符串部分。我们所做的是按位置搜索第一个标记（我们已知道在此行中），然后按位置搜索停止标记。一旦我们拥有了这些，我们就可以简单地删除我们想要的部分。
我以两种方式提升了位置。 1我在开始位置添加了3，所以它会跳过[A＆gt;] - 因此而不是给'[A＆gt;]这是我的信号......'它只是给'这是我的信号......'然后我通过在[A＆gt;]标签之后查找其第一次出现来搜索结束位置，包括[/ W]标签在每一行出现多次。
我们设置了字符串切片的答案和一个换行符号（'\ n'），因此每个字符串都出现在它自己的行上。我们使用输出方法.write（'stringToWrite'）来编写每个字符串，一次一个。

Answer 3

使用readlines（）会出现内存错误，因为在文件大小的情况下，您可能会读取的数据超出内存可以合理处理的数据量。由于这个文件是一个XML文件，你应该能够通读它iterparse（），它将懒惰地解析XML而不占用多余的内存。这是我用来解析维基百科转储的一些代码：

for event, elem in parser:
    if event == 'start' and root == None:
        root = elem
    elif event == 'end' and elem.tag == namespace + 'title':
        page_title = elem.text
        #This clears bits of the tree we no longer use.
        elem.clear()
    elif event == 'end' and elem.tag == namespace + 'text':
        page_text = elem.text
        #Clear bits of the tree we no longer use
        elem.clear()

        #Now lets grab all of the outgoing links and store them in a list
        key_vals = []


        #Eliminate duplicate outgoing links.
        key_vals = set(key_vals)
        key_vals = list(key_vals)

        count += 1

        if count % 1000 == 0:
            print str(count) + ' records processed.'
    elif event == 'end' and elem.tag == namespace + 'page':
        root.clear()

以下是它的工作原理：

我们创建解析器以在文档中前进。
当我们遍历文档的每个元素时，我们会查找包含您要查找的标记的元素（在您的示例中为'A'）。
我们存储该数据并对其进行处理。我们处理的任何元素都清楚了，因为当我们浏览文档时它会保留在内存中，所以我们想删除任何不再需要的东西。

Answer 4

你应该研究一个名为“Grep”的工具。你可以给它一个匹配的模式和一个文件，如果你愿意，它会打印出文件和行号中的出现。非常有用，可能可以与Python接口。

Answer 5

而不是手动解析文件，为什么不将其解析为XML以更好地控制数据？您提到数据类似HTML，因此我假设它可以作为XML文档进行解析。

Answer 6

请测试以下代码：

import re

regx = re.compile('<A>.+?</A>.*?<W>.*?</W>')

with open('/Users/Desktop/2e.txt','rb')         as open_file,\
     open('/Users/Desktop/Poetrylist.txt','wb') as output_file:

    remain = ''

    while True:
        chunk = open_file.read(65536) # 65536 == 16 x 16 x 16 x 16
        if not chunk:  break
        output_file.writelines( mat.group() + '\n' for mat in regx.finditer(remain + chunk) )
        remain = chunk[mat.end(0)-len(remain):]

我无法测试它，因为我没有要测试的文件。

在Python中搜索TXT文件

6 个答案: