Question

我尝试使用传统的漂亮汤方法从html页面提取文本。我遵循了another SO answer中的代码。

import urllib
from bs4 import BeautifulSoup

url = "http://orizon-inc.com/about.aspx"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

我能够使用大多数页面的正确提取文本。但是对于某些特定的页面，例如我提到的页面，我在段落中的单词之间出现了换行符。

结果：

\nAt Orizon, we use our extensive consulting, management, technology and\nengineering capabilities to design, develop,\ntest, deploy, and sustain business and mission-critical solutions to government\nclients worldwide.\nBy using proven management and technology deployment\npractices, we enable our clients to respond faster to opportunities,\nachieve more from their operations, and ultimately exceed\ntheir mission requirements.\nWhere\nconverge\nTechnology & Innovation\n© Copyright 2019 Orizon Inc., All Rights Reserved.\n>'

结果是，在技术与\ n工程，开发，\ ntest 等之间出现了新的界限。

这些都是同一段落中的所有文本。

如果我们在html源代码中查看它是正确的：

<p>
            At Orizon, we use our extensive consulting, management, technology and 
            engineering capabilities to design, develop, 
        test, deploy, and sustain business and mission-critical solutions to government 
            clients worldwide. 
    </p>
    <p>
            By using proven management and technology deployment 
            practices, we enable our clients to respond faster to opportunities, 
            achieve more from their operations, and ultimately exceed 
            their mission requirements.
    </p>

这是什么原因？怎样才能准确地提取呢？

Answer 1

您应该按照HTML标记来拆分文本，而不是按行拆分文本，因为对于每个段落和标题，您都希望将内部的文本去除换行符。

您可以通过遍历所有感兴趣的元素（我包括p，h2和h1，但您可以扩展列表）来做到这一点，并针对每个元素将其剥离任何换行符，然后在该元素的末尾追加一个换行符，以在下一个元素之前创建换行符。

这是一个可行的实现：

import urllib.request
from bs4 import BeautifulSoup

url = "http://orizon-inc.com/about.aspx"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,'html.parser')

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# put text inside paragraphs and titles on a single line
for p in soup(['h1','h2','p']):
    p.string = " ".join(p.text.split()) + '\n'

text = soup.text
# remove duplicate newlines in the text
text = '\n\n'.join(x for x in text.splitlines() if x.strip())

print(text)

输出样本：

login

About Us

At Orizon, we use our extensive consulting, management, technology and engineering capabilities to design, develop, test, deploy, and sustain business and mission-critical solutions to government clients worldwide.

By using proven management and technology deployment practices, we enable our clients to respond faster to opportunities, achieve more from their operations, and ultimately exceed their mission requirements.

如果您不希望段落/标题之间出现空白，请使用：

text = '\n'.join(x for x in text.splitlines() if x.strip())

Answer 2

如果您只想从段落标签中获取内容，请尝试

paragraph = soup.find('p').getText()

将HTML中的段落文本格式化为单行

2 个答案: