Question

我对python有一个奇怪的解析问题。我需要解析以下文字。

这里我只需要（不包括）“pre”标签和数字列之间的部分（从205 4 164开始）。我有这种格式的几页。

<html>
<pre>


A Short Study of Notation Efficiency

CACM August, 1960

Smith Jr., H. J.

CA600802 JB March 20, 1978  9:02 PM

205 4   164
210 4   164
214 4   164
642 4   164
1   5   164

</pre>
</html>

Answer 1

Quazi，这需要一个正则表达式，特别是<pre>(.+?)(?:\d+\s+){3}并启用了DOTALL标志。

你可以在http://docs.python.org/library/re.html了解如何在Python中使用正则表达式，如果你做了很多这种字符串提取，你会很高兴你做到了。逐步完成我提供的正则表达式：

<pre>只是直接匹配预标签
(.+?)匹配并捕获任何字符
(?:\d+\s+){3}匹配一些数字后跟一些空格，连续三次

Answer 2

这是一个正则表达式：

findData = re.compile('(?<=<pre>).+?(?=[\d\s]*</pre>)', re.S)

# ...

result = findData.search(data).group(0).strip()

Here's a demo.

Answer 3

我可能会使用lxml或BeautifulSoup。 IMO，正则表达式被严重滥用，特别是在解析HTML时。

Answer 4

其他人提供了正则表达式解决方案，这些解决方案很好，但有时可能会出乎意料。

如果页面完全如您的示例所示，那就是：

不存在其他HTML标记 - 仅限<html>和<pre>标记
行数始终一致
行之间的间距始终一致

然后像这样的简单方法就可以了：

my_text = """<html>
<pre>


A Short Study of Notation Efficiency

CACM August, 1960

Smith Jr., H. J.

CA600802 JB March 20, 1978  9:02 PM

205 4   164
210 4   164
214 4   164
642 4   164
1   5   164

</pre>
</html>"""

lines = my_text.split("\n")

title   = lines[4]
journal = lines[6]
author  = lines[8]
date    = lines[10]

如果您无法保证行之间的间距，但您可以保证只需要<html><pre>内的前四个非空格行 ;

import pprint

max_extracted_lines = 4
extracted_lines = []
for line in lines:
    if line == "<html>" or line == "<pre>":
        continue
    if line:
        extracted_lines.append(line)
    if len(extracted_lines) >= max_extracted_lines:
        break

pprint.pprint(extracted_lines)

给出输出：

['A Short Study of Notation Efficiency',
 'CACM August, 1960',
 'Smith Jr., H. J.',
 'CA600802 JB March 20, 1978  9:02 PM']

不要在简单的字符串操作中使用正则表达式。

python中复杂的解析

4 个答案: