适当地unescape html字符

时间:2015-12-11 02:47:22

标签: python html lxml

我正在尝试抓一部电影故事大纲,但我很难解开一些麻烦的角色:

import requests
from lxml import html

res = requests.get('https://play.google.com/store/tv/show?id=lXH-sW6govE')
node=html.fromstring(res.content)
synopsis=node.xpath("//div[contains(@class, 'details-section') and contains(@class, 'description')]/meta")[0].attrib['content']

u'"Work Out New York" invites viewers to break a sweat with some of New York City\xe2\x80\x99s hottest personal trainers. They may be friends, but these high-end fitness experts compete against each other to earn the business of wealthy patrons and celebrity clientele. With training techniques and fitness regimens constantly evolving, these trainers better shape up or risk losing their clients to their competitors. Romances, jealousies, and bitter rivalries provide the ultimate test of endurance for these fitness fanatics.'

如何在https://play.google.com/store/tv/show?id=lXH-sW6govE获取正确编码的摘要,即""Work Out New York" invites viewers to break a sweat with some of New York City’s hottest personal trainers. They may be friends, but these high-end fitness experts compete against each other to earn the business of wealthy patrons and celebrity clientele. With training techniques and fitness regimens constantly evolving, these trainers better shape up or risk losing their clients to their competitors. Romances, jealousies, and bitter rivalries provide the ultimate test of endurance for these fitness fanatics."

1 个答案:

答案 0 :(得分:0)

您可以使用HTMLParser来解决这个问题。类似的东西:

print HTMLParser.HTMLParser().unescape(synopsis)
"Work Out New York" invites viewers to break a sweat with some of New York Cityâs hottest personal trainers. They may be friends, but these high-end fitness experts compete against each other to earn the business of wealthy patrons and celebrity clientele. With training techniques and fitness regimens constantly evolving, these trainers better shape up or risk losing their clients to their competitors. Romances, jealousies, and bitter rivalries provide the ultimate test of endurance for these fitness fanatics.

此处有更多详情:How do I unescape HTML entities in a string in Python 3.1?