如何用beautifulsoup刮掉繁体中文?

时间:2015-06-24 13:20:04

标签: python python-3.x utf-8 beautifulsoup

我正在使用beautifulsoup从这个网站上抓取中文文本。

有时它有效:

http://www.fashionguide.com.tw/Beauty/08/MsgL.asp?LinkTo=TopicL&TopicNum=13976&Absolute=1
Tsaio上山採藥 輕油水感全效UV防曬精華

有时它不起作用:

http://www.fashionguide.com.tw/Beauty/08/MsgL.asp?LinkTo=TopicL&TopicNum=13996&Absolute=1
MAYBELLINE´A¤ñµY ³z¥Õ¼á²bªø®Ä¢ã¢ä¯»»æ

当我尝试用utf-8编码:

title1 = tds.find("span",attrs={"class": "style1", "itemprop": "brand"})
title2 = tds.find("span",attrs={"class": "style1", "itemprop": "name"})
print ((title1.text + title2.text).encode('utf-8'))

我明白了:

b'MAYBELLINE\xc2\xb4A\xc2\xa4\xc3\xb1\xc2\xb5Y \xc2\xb3z\xc2\xa5\xc3\x95\xc2\xbc\xc3\xa1\xc2\xb2b\xc2\xaa\xc3\xb8\xc2\xae\xc3\x84\xc2\xa2\xc3\xa3\xc2\xa2\xc3\xa4\xc2\xaf\xc2\xbb\xc2\xbb\xc3\xa6'

如何获取正确的中文文本?

修改 我刚刚切换到python3,所以我可能犯了一些错误。这就是我抓住html的方式:

contentb = urllib.request.urlopen(urlb).read()
soupb = BeautifulSoup(contentb)

1 个答案:

答案 0 :(得分:0)

正如您所正确注意到的那样,默认的BS解析器在这种情况下不起作用。还明确使用Big5(在你的html中声明的charset)。

但你应该使用lxml + BeautifulSoup完成你的工作,注意用字节初始化你的汤,而不是unicode。

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

http://docs.python-requests.org/en/latest/api/#requests.Response.content

from bs4 import BeautifulSoup
import requests

base_url = '?'.join(['http://www.fashionguide.com.tw/Beauty/08/MsgL.asp',
                     'LinkTo=TopicL&TopicNum={topic}&Absolute=1'])
topics = [13976, 13996, ]

for t in topics:
    url = base_url.format(topic=t)
    page_content = requests.get(url).content  # returns bytes
    soup = BeautifulSoup(page_content, 'lxml')
    title1 = soup.find("span", attrs={"class": "style1", "itemprop": "brand"})
    title2 = soup.find("span", attrs={"class": "style1", "itemprop": "name"})
    print(title1.text + title2.text)

这是使用xpath的相同解决方案,我更喜欢: - )

from lxml import html
import requests

base_url = '?'.join(['http://www.fashionguide.com.tw/Beauty/08/MsgL.asp',
                     'LinkTo=TopicL&TopicNum={topic}&Absolute=1'])
topics = [13976, 13996, ]

xp1 = "//*[@itemprop='brand']/text()"
xp2 = "//*[@itemprop='brand']/following-sibling::span[1]/text()"

for t in topics:
    url = base_url.format(topic=t)
    page_content = requests.get(url).content
    tree = html.fromstring(page_content)
    title1 = tree.xpath(xp1)  # returns a list!
    title2 = tree.xpath(xp2)
    title = " ".join(title1 + title2)
    print(title)