我正在尝试提取TripAdvisor上每家餐厅的电子邮件地址。
我已经尝试过了,但是一直返回[]:
response.xpath('//*[@class= "restaurants-detail-overview-cards-LocationOverviewCard__detailLink--iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem--89flT6"]')
TripAdvisor页面上的代码段如下:
<div class="restaurants-detail-overview-cards-LocationOverviewCard__detailLink--iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem--1flT6"><span><a href="mailto:info@canopylounge.my?subject=?"><span class="ui_icon email restaurants-detail-overview-cards-LocationOverviewCard__detailLinkIcon--T_k32"></span><span class="restaurants-detail-overview-cards-LocationOverviewCard__detailLinkText--co3ei">Email</span><span class="ui_icon external-link-no-box restaurants-detail-overview-cards-LocationOverviewCard__upLinkIcon--1oVn1"></span></a></span></div>
答案 0 :(得分:1)
首先:您在班级名称上有误。
第二个:它在<div>
中是类,但是@href
在<a>
中。而且<a>
不在<div>
之后,因此您需要
'//*[@class="..."]//a/@href'
(我跳过类名,因为它太长了,无法显示)
您可以尝试使用
代替这么长的类名'//a[contains(@href, "mailto")]/@href'
我使用xpath
测试了lxml
text = '''<div class="restaurants-detail-overview-cards-LocationOverviewCard__detailLink--iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem--1flT6">
<span><a href="mailto:info@canopylounge.my?subject=?">
<span class="ui_icon email restaurants-detail-overview-cards-LocationOverviewCard__detailLinkIcon--T_k32"></span>
<span class="restaurants-detail-overview-cards-LocationOverviewCard__detailLinkText--co3ei">Email</span>
<span class="ui_icon external-link-no-box restaurants-detail-overview-cards-LocationOverviewCard__upLinkIcon--1oVn1"></span>
</a></span>
</div>'''
import lxml.html
soup = lxml.html.fromstring(text)
print(soup.xpath('//*[@class="restaurants-detail-overview-cards-LocationOverviewCard__detailLink--iyzJI restaurants-detail-overview-cards-LocationOverviewCard__contactItem--1flT6"]//a/@href'))
print(soup.xpath('//a[contains(@href, "mailto")]/@href'))
答案 1 :(得分:0)
这是方法之一:
import requests
from scrapy import Selector
site_link = 'https://www.tripadvisor.com/Restaurant_Review-g60713-d11882449-Reviews-Coin_Op_Game_Room-San_Francisco_California.html'
res = requests.get(site_link)
sel = Selector(res)
email = sel.xpath("//*[contains(@class,'LocationOverviewCard__contactItem--')]//a[contains(@href,'mailto:')]/@href").get()
email = email.split("mailto:")[1].split("?")[0] if email else ""
print(email)
输出:
info@coinopsf.com
答案 2 :(得分:0)
Selector
还有一个.re()
方法,用于使用正则表达式提取数据。
In [2]: response.xpath('//a[contains(@href, "mailto")]/@href')
Out[2]: [<Selector xpath='//a[contains(@href, "mailto")]/@href' data='mailto:info@coinopsf.com?subject=?'>]
In [3]: response.xpath('//a[contains(@href, "mailto")]/@href').get()
Out[3]: 'mailto:info@coinopsf.com?subject=?'
In [4]: response.xpath('//a[contains(@href, "mailto")]/@href').re('mailto:(.*)\?\w')
Out[4]: ['info@coinopsf.com']
In [5]: response.xpath('//a[contains(@href, "mailto")]/@href').re('mailto:([^?]*)')
Out[5]: ['info@coinopsf.com']