使用python lxml + xpath从页面获取视频,获取列表但不能打印出结果?

时间:2016-05-05 20:14:02

标签: python xpath lxml

新手for python,想用lxml + xpath从网页上获取视频链接,我现在拥有的是:

dlsym

import urllib2 from lxml import etree url=u"http://hkdramas.se/fashion-war-%E6%BD%AE%E6%B5%81%E6%95%99%E4%B8%BB-episode-20/" xpath=u"//script[contains(.,'label:\"360p\"')]" html=urllib2.urlopen(url).read() selector=etree.HTML(html) get=selector.xpath(xpath) print get type() get,它向我展示了list,但当我print get时,它显示我意外[<Element script at 0x2a34b88>] 1}},这是什么意思?以及如何提取视频的实际网址而不是Element script

最后,我知道为什么我遇到这个问题,谢谢@unutbu

xpath=u"//script[contains(.,'label:\"360p\"')]"

应该是

xpath=u"//script[contains(.,'label:\"360p\"')]//text()"

添加text()以确保仅返回选择元素下的文本而不返回元素,请注意//,当选择的子元素很多时,请注意兼容。

1 个答案:

答案 0 :(得分:0)

selector.xpath(xpath)会返回一个标记列表(或更准确地说,Element s)。当您打印对象列表时,Python会显示reprElement<Element script at 0x2a34b88>repr script的{​​{1}}。

如果Elementelt script,那么 Element会在elt.text标记内返回文字,但您需要使用其他内容(除了lxml)从文本中提取网址。例如,您可以使用正则表达式模式<script>来搜索以r'"(http[^"]+)"'开头的文本,并一直持续到找到另一个双引号"http

"

产量

import re
import lxml.html as LH

url = u"http://hkdramas.se/fashion-war-%E6%BD%AE%E6%B5%81%E6%95%99%E4%B8%BB-episode-20/"
xpath = u"""//script[contains(.,'label:"360p"')]"""
root = LH.parse(url)
for elt in root.xpath(xpath):
    for url in re.findall(r'"(http[^"]+)"', elt.text):
        print(url)

请注意,您无需导入http://hkdramas.se/wp-content/plugins/BSplugin-version-1.2/lib/grab.php?link1=NS71jbj8NVNANTN7N0Nq7Y7FjeN0NojTN47HNcN77_Nhjh7INm7ONLNijCNc7-7UN_NXNCjcNYjeNwNF7uNQNA7dNvNm7-Nr7vNW7-NtjN72N4jVNCN8NfN-NANm7l7rNP7ff5aa877861da31d8cc9dd087d6ce2417fb1308a676a771b787adbffbaa4a0bffNfNHjtj-N6NDNg7HjLND7F7fjMj.jVjKN1N-jMj7NXj7jNNyjTNwjgjmji7INANtNONsN2NvN6jMNaNTNdNlNON8j7N~NEjO7lNyN.jQNaNuN1NYNjjzNnNENUNmNm7Z707dNaNTNFN0N6N8N.NRNuN_7dNtjhjJN-jmNZNpjjNo7fNHjTNNNSNLjMNqNUjN7IN7NPNfNENKN3jT7dNs&link2= http://hkdramas.se/wp-content/plugins/BSplugin-version-1.2/lib/grab.php?link1=NvNeNVN4N276Nz7JNSjz7lNLNvNV7Ij3Nx7FNn7.Ni7FNU76NDNMN.NqNkNo7QNKNINiNhjPNJjmNKjPNGN.No7B7BNC7Y7B7B7lN67tjb7JNJNT7rNANrNBN7N6Nt7lN1ND0ba06b7bac4bab5fbb42dbff6c27647ea71b4f725a0c73f175eadf3b459424edN0NBNvNZj77wNL7Wj_j_71NnN0jpNfjPNqNvjDN.jEN4NRNDjijejmjXNINqNijEjENKNfNdN3jiNDNOjcNyN4NwNzN4NqNlNqNAjDNQNBN0Nk7a7Rj8NXN_NiN6NFNmNmNLNwNm7YN7j77vNfNpNljw7HjENRjmNMjVNLNEjq7BN0NON57JNyNyjpN8Nbjz7lN-NfNYNMN.7IjD7.NQ&link2= 。您可以将网址直接传递给urllib2

要仅获取字符串LH.parse后面的网址,您可以使用

'360p'