我有一个非常简单的任务,可以从网页http://subscribe.ru/catalog?rss输出锚点内的所有文本。这是我的代码:
# encoding: utf-8
from lxml import etree
import urllib2
from lxml.html import document_fromstring
data = urllib2.urlopen('http://subscribe.ru/catalog?rss')
S=data.read()
oHTML = document_fromstring(S)
loLinks = oHTML.xpath("//a")
for oLink in loLinks:
print etree.tostring(oLink)
sLink = oLink.xpath('string()')[0]
输出如下:
C:\Development\Python27\python.exe "D:/Topic Modeling/Playground/delme3.py"
Traceback (most recent call last):
File "D:/Topic Modeling/Playground/delme3.py", line 15, in <module>
<a onclick="rgNav('js_tab_auth');return false;" href="">÷ÈÏÄ ÎÁ ÓÁÊÔ</a>
sLink = oLink.xpath('string()')[0]
<a onclick="rgNav('js_tab_reg');return false;" href="">òÅÇÉÓÔÒÁÃÉÑ </a>
IndexError: string index out of range
<a class="forgot_pass" href="/member/totalrecall">úÁÂÙÌÉ ÐÁÒÏÌØ?</a>
<a class="button_blue_2" id="js_loginFormBut" href="#">÷ÏÊÔÉ</a>
<a class="font_gray link_txd" href="/faq/vereinbarung.html">ÕÓÌÏ×ÉÑ ÐÏÌØÚÏ×ÁÎÉÑ ÓÅÒ×ÉÓÏÍ Subscribe.ru</a>
<a class="button_blue_2" id="js_regFormBut" href="#">îÁÞÁÔØ ÒÅÇÉÓÔÒÁÃÉÀ</a>
<a class="rg_btn_soc rg_bs_01 js_tap_panel_selector" action="auth_email" href="#"><span><i/>Email</span></a>
<a class="rg_btn_soc rg_bs_01 js_tap_panel_selector" action="auth_openid" href="#"><span><i/>OpenID</span></a>
<a class="rg_btn_soc rg_bs_02 js_tap_panel_selector" action="auth_vkontakte" href="#"><span><i/>÷ËÏÎÔÁËÔÅ</span></a>
<a class="rg_btn_soc rg_bs_02 js_tap_panel_selector" action="auth_mailru" href="#"><span><i/>Mail.Ru</span></a>
{#/if}
{#if $P.login_register_tab == 2}
<a class="rg_btn_soc rg_bs_01 js_tap_panel_selector" action="reg_email" href="#"><span><i/>Email</span></a>
<a class="rg_btn_soc rg_bs_01 js_tap_panel_selector" action="reg_openid" href="#"><span><i/>OpenID</span></a>
<a class="rg_btn_soc rg_bs_02 js_tap_panel_selector" action="reg_vkontakte" href="#"><span><i/>÷ËÏÎÔÁËÔÅ</span></a>
<a class="rg_btn_soc rg_bs_02 js_tap_panel_selector" action="reg_mailru" href="#"><span><i/>Mail.Ru</span></a>
{#/if}
<a href="" onclick="return false;">òÅÇÉÓÔÒÁÃÉÑ</a>
<a href="" onclick="ajax_recall_code();return false">÷ÙÓÌÁÔØ ÅÝÅ ÒÁÚ</a>
<a href="#" class="button_blue_2" id="js_confirmFormBut">çÏÔÏ×Ï</a>
<a class="green" href="http://subs.link.subscribe.ru/422433"><strong>òÅÚÕÌØÔÁÔÙ ÏÎÌÁÊÎ ÏÐÒÏÓÁ: "óÐÁÍ ÉÌÉ ÎÅ ÓÐÁÍ? ÷ÏÔ × ÞÅÍ ×ÏÐÒÏÓ!"</strong></a>
<a title="Subscribe.Ru" href="/" class="logo"><dfn class="logokanal"/></a>
Process finished with exit code 1
因此提取了链接,但由于某种原因无法提取链接文本。输出暗示编码存在一些问题(引用内容仅由人类可读文本组成)。我怎么能解决这个问题?
尝试使用utf-8进行解码也不起作用:
# encoding: utf-8
from lxml import etree
import urllib2
import chardet
from lxml import html
data = urllib2.urlopen('http://subscribe.ru/catalog?rss')
S=data.read()
encoding = chardet.detect(S)['encoding']
print encoding
if encoding != 'utf-8':
S = S.decode(encoding,'replace').encode('utf-8')
oHTML = html.fromstring(S)
loLinks = oHTML.xpath("//a")
for oLink in loLinks:
print etree.tostring(oLink)
sLink = oLink.xpath('string()')[0]
失败并出现同样的错误。
提前感谢您的帮助!
答案 0 :(得分:1)
你得到IndexError
(问题与编码无关)。
如果<a>
元素为空(并且其中一些元素位于该网页上),则代码中的oLink.xpath('string()')
将返回一个空列表。然后oLink.xpath('string()')[0]
会为您提供IndexError
。
以下代码将为您提供您想要的(我认为)。 HTML页面以KOI8-R编码。请注意,您可以使用lxml直接从URL解析。
from lxml import html
URL = 'http://subscribe.ru/catalog?rss'
parser = html.HTMLParser(encoding="KOI8-R")
content = html.parse(URL, parser)
anchors = content.xpath("//a")
for anchor in anchors:
text = anchor.text
if text: # if the anchor is not empty
print text.encode("utf-8")
此程序的输出以:
开头Вход на сайт
Регистрация
Забыли пароль?
Войти
условия пользования сервисом Subscribe.ru
Начать регистрацию
Регистрация
Выслать еще раз
Готово
并以:
结束Спорт
Прогноз погоды
Новости и СМИ
Страны и Регионы
Общество
Дом и семья
Все разделы
ЗАО «Интернет-Проекты»