我差不多写完第一个刮刀了!
我遇到了障碍,但是:我似乎无法抓住包含表格的帖子的内容(换句话说,引用另一个帖子的帖子)。
这是从汤对象中提取帖子内容的代码。它运作得很好:
def getPost_contents(soup0bj):
try:
soup0bj = (soup0bj)
post_contents = []
for content in soup0bj.findAll('', {'class' : 'post_content'}, recursive = 'True'):
post_contents.append(content.text.strip())
...#Error management
return (post_contents)
以下是我需要刮擦的示例(以黄色突出显示):
(网址,以防万一:http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906)
如何获取我突出显示的内容?为什么我当前的getPostcontents函数在这个特定的实例中不起作用?据我所知,字符串仍然在div class = post_contents下。
编辑编辑编辑
这就是我获取BeautifulSoup的方式:
from bs4 import BeautifulSoup as Soup
def getHTMLsoup(url):
try:
html = urlopen(url)
...#Error management
try:
soup0bj = Soup(html.read().decode('utf-8', 'replace'))
time.sleep(5)
...#Error management
return (soup0bj)
EDIT2 EDIT2 EDIT2
这些是刮刀的相关部分:(抱歉转储!)
from bs4 import BeautifulSoup as Soup
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError, URLError
import time, re
def getHTMLsoup(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
print('The server hosting{} is unavailable.'.format(url), '\n')
print('Trying again in 10 minutes...','\n')
time.sleep(600)
getHTMLsoup(url)
except URLError as e:
return None
print('The webpage found at {} is unavailable.'.format(url),'\n')
print('Trying again in 10 minutes...','\n')
time.sleep(600)
getHTMLsoup(url)
try:
soup0bj = Soup(html.read().decode('utf-8', 'replace'))
time.sleep(5)
except AttributeError as e:
return None
print("Ooops, {}'s HTML structure wasn't detected.".format(url),'\n')
return soup0bj
def getMessagetable(soup0bj):
try:
soup0bj = (soup0bj)
messagetable = []
for data in soup0bj.findAll('tr', {'class' : re.compile('message.*')}, recursive = 'True'):
except AttributeError as e:
print(' ')
return (messagetable)
def getTime_stamps(soup0bj):
try:
soup0bj = (soup0bj)
time_stamps = []
for stamp in soup0bj.findAll('span', {'class' : 'topic_posted'}):
time_stamps.append(re.search('..\/..\/20..', stamp.text).group(0))
except AttributeError as e:
print('No time-stamps found. Moving on.','\n')
return (time_stamps)
def getHandles(soup0bj):
try:
soup0bj = (soup0bj)
handles = []
for handle in soup0bj.findAll('span', {'data-id_user' : re.compile('.*')}, limit = 1):
handles.append(handle.text)
except AttributeError as e:
print("")
return (handles)
def getPost_contents(soup0bj):
try:
soup0bj = (soup0bj)
post_contents = []
for content in soup0bj.findAll('div', {'class' : 'post_content'}, recursive = 'True'):
post_contents.append(content.text.strip())
except AttributeError as e:
print('Ooops, something has gone wrong!')
return (post_contents)
html = ('http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm')
for soup in getHTMLsoup(html):
for messagetable in getMessagetable(soup):
print(getTime_stamps(messagetable),'\n')
print(getHandles(messagetable),'\n')
print(getPost_contents(messagetable),'\n')
答案 0 :(得分:2)
问题是你的解码,它不是utf-8,如果删除"replace"
你的代码会出错:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 253835: invalid continuation byte
数据似乎是 latin-1 编码,解码为 latin-1 导致没有错误,但输出确实在某些部分中查看,使用。
html = urlopen(r).read().decode("latin-1")
会起作用,但正如我所提到的,你会得到奇怪的输出,如:
"diabète en cas d'accident de la route ou malaise isolÊ ou autre ???"
另一种选择是传递一个accept-charset标题:
from urllib.request import Request, urlopen
headers = {"accept-charset":"utf-8"}
r = Request("http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906",headers=headers)
html = urlopen(r).read()
我使用让它处理编码的请求得到完全相同的编码问题,就像数据有混合编码,一些utf-8和一些latin-1。从请求返回的标头将内容编码显示为gzip:
'Content-Encoding': 'gzip'
如果我们指定我们想要gzip和decode:
from urllib.request import Request, urlopen
headers = {"Accept-Encoding":"gzip"}
r = Request("http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906",headers=headers)
r = urlopen(r)
import gzip
gzipFile = gzip.GzipFile(fileobj=r)
print(gzipFile.read().decode("latin-1"))
我们得到了与utf-8相同的错误以及与latin-1相同的奇怪输出解码。有趣的是,在python2中,请求和urllib都可以正常工作。
使用chardet:
r = urlopen(r)
import chardet
print(chardet.detect(r.read()))
估计有大约71%的信心是ISO-8859-2
,但这又会产生相同的不良输出。
{'confidence': 0.711104254322944, 'encoding': 'ISO-8859-2'}