我已经用python在selenium中创建了一个脚本,以抓取位于网站Contact details
中的网站地址。但是,问题是没有与该链接关联的网址(不过,我可以单击该链接)。
如何解析
Contact details
中的网站链接?
from selenium import webdriver
URL = 'https://www.truelocal.com.au/business/vitfit/sydney'
def get_website_link(driver,link):
driver.get(link)
website = driver.find_element_by_css_selector("[ng-class*='getHaveSecondaryWebsites'] > span").text
print(website)
if __name__ == '__main__':
driver = webdriver.Chrome()
try:
get_website_link(driver,URL)
finally:
driver.quit()
运行脚本时,我看到与该链接关联的可见文本Visit website
。
答案 0 :(得分:1)
带有“访问网站”文本的元素是span
,具有vm.openLink(vm.getReadableUrl(vm.getPrimaryWebsite()),'_blank')
JavaScript而不是实际的href。
我的建议是,如果您的目标是抓取而不是进行测试,则可以将下面的解决方案与requests
包一起使用,以json形式获取数据并提取所需的任何信息。
就像您一样,另一个实际上是点击。
import requests
import re
headers = {
'Referer': 'https://www.truelocal.com.au/business/vitfit/sydney',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/73.0.3683.75 Safari/537.36',
'DNT': '1',
}
response = requests.get('https://www.truelocal.com.au/www-js/configuration.constant.js?v=1552032205066',
headers=headers)
assert response.ok
# extract token from response text
token = re.search("token:\\s'(.*)'", response.text)[1]
headers['Accept'] = 'application/json, text/plain, */*'
headers['Origin'] = 'https://www.truelocal.com.au'
response = requests.get(f'https://api.truelocal.com.au/rest/listings/vitfit/sydney?&passToken={token}', headers=headers)
assert response.ok
# use response.text to get full json as text and see what information can be extracted.
contact = response.json()["data"]["listing"][0]["contacts"]["contact"]
website = list(filter(lambda x: x["type"] == "website", contact))[0]["value"]
print(website)
print("the end")