Question

我在stackoverflow（BeautifulSoup Grab Visible Webpage Text）的其他地方使用此解决方案来获取带有漂亮汤的网页文本：

import requests
from bs4 import BeautifulSoup

# error handling

from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

# settings

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

url = "http://imfuna.com"

response = requests.get(url, headers=headers, verify=False)

soup = BeautifulSoup(response.text, "lxml")

for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
text = '\n'.join(chunk for chunk in chunks if chunk)

front_text_count = len(text.split(" "))
print front_text_count
print text

对于大多数网站而言，它运作良好但是对于上面的网址示例（imfuna.com），它只检索了6个单词，尽管网页上有更多的单词（例如“住宅或商业物业测量员的数字检查”）。

如果上面的示例单词没有包含在使用此代码的文本输出中，则实际代码位于p / h1标记内，我无法理解为什么它没有被代码拾取？

其他人可以建议一种简单地从网页上读取纯文本的方法，以便正确地选择它吗？

谢谢！

从美丽的汤和蟒蛇网页获取所有可见文本

0 个答案: