Question

编辑：我无法相信BeautifullSoup实际上无法正确解析HTML。实际上我可能做错了什么，但如果我不这样做，这是一个非常业余的模块。

我试图从网上获取文字，但我无法这样做，因为我总是在大多数句子中得到一些奇怪的字符。我从来没有得到一个句子，其中包含诸如“不正确”之类的词语。

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()

#get paragraphs
soup = BeautifulSoup(html)
textList = soup.find_all('p')
mytext = ""
for par in textList:
    if len(str(par))<2000: 
    print par
    mytext +=" " +  str(par)

 print "the text is ", mytext

结果包含一些奇怪的字符：

The plural of â€œcomedoâ€? is comedomesâ€?.</p>
Surprisingly, the visible black head isnâ€™t caused by dirt

显然我想要的不是取而代之的。我该怎么办？

Answer 1

我认为问题在于您的系统输出编码，因为它超出了显示的字符范围，所以无法正确输出编码字符。

BeautifulSoup4旨在完全支持HTML实体。

请注意这些命令的奇怪行为：

>python temp.py
...
ed a blackhead. The plural of ÔÇ£comedoÔÇØ is comedomesÔÇØ.</p>
...

>python temp.py > temp.txt

>cat temp.txt
....
ed a blackhead. The plural of "comedo" is comedomes".</p> <p> </p> <p>Blackheads is an open and wide
....

我建议您将输出写入文本文件，或者使用不同的终端/更改终端设置以支持更广泛的字符。

Answer 2

由于这是Python 2，urllib.urlopen().read()调用返回最可能以UTF-8编码的字节字符串 - 您可以查看HTTP标头以查看特定包含的编码。我假设是UTF-8。

在开始处理内容之前，您无法解码此外部表示，这只会导致眼泪。一般规则：立即解码输入，仅在输出上进行编码。

这是您的工作形式的代码，只有两处修改;

import urllib2
from BeautifulSoup import BeautifulSoup

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = unicode(myreq.read(), "UTF-8")

#get paragraphs
soup = BeautifulSoup(html)
textList = soup.findAll('p')
mytext = ""
for par in textList:
    if len(str(par))<2000: 
      print par
      mytext +=" " +  str(par)

print "the text is ", mytext

我所做的就是添加html的unicode解码并使用soup.findAll()而不是soup.find_all()。

Answer 3

这是一个基于人们从这里和我的研究中得到答案的解决方案。

import html2text
import urllib2
import re
import nltk

useragent = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'}
request = urllib2.Request('SomeURL',None,useragent)
myreq = urllib2.urlopen(request, timeout = 5)
html = myreq.read()
html = html.decode("utf-8")


textList = re.findall(r'(?<=<p>).*?(?=</p>)',html, re.MULTILINE|re.DOTALL)
mytext = ""
for par in textList:
   if len(str(par))<2000: 
    par = re.sub('<[^<]+?>', '', par)
    mytext +=" " +  html2text.html2text(par)

 print "the text is ", mytext

无法正确地将HTML从站点转换为文本

3 个答案: