python刮一个类href

时间:2019-12-11 16:49:42

标签: python web-scraping

我想使用python3抓取href链接

现有代码:

re.findall()

来自此代码:

import lxml.html
import requests

dom = lxml.html.fromstring(requests.get('https://www.tripadvisor.co.uk/Search?singleSearchBox=true&geo=191&pid=3825&redirect=&startTime=1576072392277&uiOrigin=MASTHEAD&q=the%20grilled%20cheese%20truck&supportedSearchTypes=find_near_stand_alone_query&enableNearPage=true&returnTo=https%253A__2F____2F__www__2E__tripadvisor__2E__co__2E__uk__2F__&searchSessionId=AF4BFA0308CF336B90FD9602FA122CD11576072382852ssid&social_typeahead_2018_feature=true&sid=AF4BFA0308CF336B90FD9602FA122CD11576072410521&blockRedirect=true&ssrc=a&rf=1').content)

result = dom.xpath("//a[@class='review_count']/@href")

print (result)

使用我现有的代码,我将得到空白打印

我已经在此处找到链接:

<a class="review_count" href="/Restaurant_Review-g54774-d10073153-Reviews-The_Grilled_Cheese_Truck-Rapid_City_South_Dakota.html#REVIEWS" onclick="return false;" data-clicksource="ReviewCount">3 reviews</a>

因此将需要帮助,在这种情况下,获取locationId和selectedId进行打印会更好

有什么想法吗?

1 个答案:

答案 0 :(得分:0)

您遇到的问题是因为数据是通过javascript加载的-尝试在禁用javascript的情况下查看页面

您可以尝试使用可与javascript一起运行的工具,例如。硒-https://selenium-python.readthedocs.io/

或者尝试跟踪JavaScript从何处加载数据,然后直接使用python请求