我试图通过请求和etree提取IMDb上的电影信息。 我检查了response.status_code。它返回了200。 但是,当我从Chrome驱动程序复制xpath时。什么也没有退还。 有人可以帮我检查出什么问题吗?
base_url = 'https://www.imdb.com/'
movie = 'Ralph Breaks the Internet'
user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1985.143 Safari/537.36'
headers = {'User-Agent': user_agent}
response = requests.get(base_url,movie,headers=headers)
response.status_code
##returns:200
selector = etree.HTML(response.content)
selector.text
##returns:'\n '##
答案 0 :(得分:1)
在您的代码中出现了几个问题。首先,我认为您应该知道您要的网址是什么。
url = 'https://www.imdb.com/'
movie = 'Ralph Breaks the Internet'
response = requests.get(url,movie,headers=headers)
print(response.url)
# https://www.imdb.com/?Ralph%20Breaks%20the%20Internet
通过代码变量moive
及其描述。我想您想要的网址是https://www.imdb.com/find?ref=nv_sr_fn&q=Ralph+Breaks+the+Internet&s=all
。
您的选择器未选择任何东西。
import requests
from lxml import etree
base_url = 'https://www.imdb.com/find'
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36',
}
movie = 'Ralph Breaks the Internet'
params = {
"ref" : "nv_sr_fn",
"q" : movie,
"s" : "all"
}
response = requests.get(base_url,params=params,headers=headers)
print(response.url)
selector = etree.HTML(response.content)
td_a = selector.xpath("//td[@class='result_text']/a")
print(len(td_a))
for ele in td_a:
print("Moive:{} Year:{} Link:{}".format(ele.text,ele.tail,ele.get("href")))
输出:
https://www.imdb.com/find?ref=nv_sr_fn&q=Ralph+Breaks+the+Internet&s=all
10
Moive:Ralph Breaks the Internet Year: (2018) Link:/title/tt5848272/?ref_=fn_al_tt_1
Moive:Ralph Breaks the Internet Year: (2018) (TV Episode) Link:/title/tt9319312/?ref_=fn_al_tt_2
Moive:Ralph Breaks the Internet Year: (2018) (TV Episode) Link:/title/tt9167490/?ref_=fn_al_tt_3
Moive:Ralph Breaks The Internet: Into The Internet With Ralph and Vanellope Year: (2018) (Short) Link:/title/tt9274902/?ref_=fn_al_tt_4
Moive:Ralph Breaks The Internet: NCM Piece Year: (2018) (Short) Link:/title/tt9274886/?ref_=fn_al_tt_5
Moive:Ralph Breaks The Internet: Slaughter Race Year: (2018) (Short) Link:/title/tt9274952/?ref_=fn_al_tt_6
Moive:Ralph Breaks the Internet Year: (2018) (TV Episode) Link:/title/tt9335792/?ref_=fn_al_tt_7
Moive:Ralph Breaks the Internet Year: (2018) (TV Episode) Link:/title/tt9335920/?ref_=fn_al_tt_8
Moive:Review de "Ralph Breaks the Internet" Year: (2018) (TV Episode) Link:/title/tt9239874/?ref_=fn_al_tt_9
Moive:The Girl in the Spider's Web/Ralph Breaks the Internet Year: (2018) (TV Episode) Link:/title/tt9324796/?ref_=fn_al_tt_10
答案 1 :(得分:0)
尝试使用etree.parse而不是etree.HTML:
from lxml import etree
response = requests.get(base_url,movie,headers=headers)
if response.status_code == 200:
selector = etree.parse(response.content)
print(selector.text)