Question

[EDITED]

我正在使用Google App Engine，我正在尝试解析HTML内容以提取一些信息。我正在使用的代码是：

from google.appengine.ext import webapp
from google.appengine.ext.webapp import util
from google.appengine.api import urlfetch
import BeautifulSoup

class MainHandler(webapp.RequestHandler):
    def get(self):
        url = 'http://ascodevida.com/ultimos'
        result = urlfetch.fetch(url=url)
        # ADVS de esta página.
        res = BeautifulSoup.BeautifulSoup(result.content).findAll('div', {'class' : 'box story'})
        ADVList = []
        for i in res:
            story = i.find('a', {'class' : 'advlink'}).string
            link = i.find('a', {'class' : 'advlink'})['href']
            ADVData = {
                'adv' : story,
                'link' : link
            }
            ADVList.append(ADVData)

        self.response.headers['Content-Type'] = 'text/html; charset=UTF-8'
        self.response.out.write(ADVList)

这段代码会产生一个奇怪的字符响应。我曾尝试使用BeautifulSoup库的prettify（）和renderContent（）方法，但效果不佳。

任何解决方案？再次感谢。

Answer 1

我是一名java开发人员，我正在使用jsoup进行HTML解析。我找到了类似的python。这可能对你有所帮助。节省你的时间。

http://www.crummy.com/software/BeautifulSoup/

大脑食物： Python regular expression for HTML parsing (BeautifulSoup)

Answer 2

我认为您正在直接打印列表，其中 repr ，默认输出为十六进制格式（如\ xe1）。

你可以试试这个：

>>> s = u"Leer más"
>>> repr(s)
"'Leer m\\xc3\\xa1s'"

但print语句会尝试解码字符串：

>>> print s
Leer más

如果您想要正确的结果，只需避免列表的默认行为并自行处理每个项目。

分裂字符串时丢失编码

2 个答案: