制作我的刮刀感觉我没有做错任何事,但是当我运行它时,它既不会取出任何数据也不会引发任何错误。我追求的这三个领域(电话,网页和电子邮件)。看来电子邮件和网页链接是隐藏的,因此这两个字段的xpath相当混乱。任何想法将受到高度赞赏。到目前为止,我已尝试过:
import requests
from lxml import html
def startpoint():
url="https://www.truelocal.com.au/business/strata-report-sydney/sydney"
page=requests.get(url, headers={"user-agent" : "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"})
tree=html.fromstring(page.text)
titles=tree.xpath('//div[@class="column"]')
for title in titles:
Phone=title.xpath(".//span[contains(concat(' ', @class, ' '), ' ng-binding ')]/text()")[0]
Web=title.xpath('.//span[@class="text-frame"]')[0]
Email=title.xpath('.//a[@class="iconed-text"]/@href')[0]
print(Phone,Web,Email)
startpoint()
以下项目的元素:
<div class="column" ng-class="vm.getTabletClass()">
<bdp-details-contact-website listing="vm.listing" contacts="vm.listing.contacts" class="ng-isolate-scope"><!-- ngIf: vm.getHavePrimaryWebsite()==true --><a class="iconed-text link-color-white-bck ng-scope" ng-if="vm.getHavePrimaryWebsite()==true" rel="nofollow" ng-click="vm.bdpEventTracking();">
<span class="icon-holder">
<i class="icon icon-computer-notebook-1"></i>
</span>
<span class="text-frame" ng-class="(vm.getHaveSecondaryWebsites()==true) ? 'with-aditional-item':''">
<span ng-click="vm.openLink(vm.getReadableUrl(vm.getPrimaryWebsite()),'_blank')" role="button" tabindex="0">Visit website</span>
</span>
</a><!-- end ngIf: vm.getHavePrimaryWebsite()==true --> <!-- iconed-text-->
<!-- ngRepeat: contact in vm.getSecondaryWebsites() --> <!-- iconed-text-->
</bdp-details-contact-website>
<a href="" class="iconed-text" ng-show="vm.isContactEmail" aria-hidden="false">
<span class="icon-holder">
<i class="icon icon-email"></i>
</span>
<span class="text-frame emailBusiness">
<span ng-click="vm.emailABusiness($event);" role="button" tabindex="0">Email this business</span>
</span>
</a> <!-- iconed-text-->
<div>
<bdp-details-contact-phone contacts="vm.listing.contacts" priority-number="vm.listing.preferences" class="ng-isolate-scope"><!-- ngRepeat: number in vm.getNumbers() --><!-- ngIf: vm.haveNumbers --><span class="iconed-text ng-scope" ng-if="vm.haveNumbers" ng-repeat="number in vm.getNumbers()">
<span class="icon-holder">
<!-- ngIf: $index==0 --><i class="icon-phone-call-2 ng-scope" ng-if="$index==0"></i><!-- end ngIf: $index==0 -->
</span>
<span class="text-frame">
<!-- ngIf: vm.isMobile -->
<!-- ngIf: !vm.isMobile --><span ng-if="!vm.isMobile" class="ng-binding ng-scope">0421 298 888</span><!-- end ngIf: !vm.isMobile -->
</span>
</span><!-- end ngIf: vm.haveNumbers --><!-- end ngRepeat: number in vm.getNumbers() --><!-- ngIf: vm.haveNumbers --><span class="iconed-text ng-scope" ng-if="vm.haveNumbers" ng-repeat="number in vm.getNumbers()">
<span class="icon-holder">
<!-- ngIf: $index==0 -->
</span>
<span class="text-frame">
<!-- ngIf: vm.isMobile -->
<!-- ngIf: !vm.isMobile --><span ng-if="!vm.isMobile" class="ng-binding ng-scope">0478 151 999</span><!-- end ngIf: !vm.isMobile -->
</span>
</span><!-- end ngIf: vm.haveNumbers --><!-- end ngRepeat: number in vm.getNumbers() --> <!-- iconed-text-->
</bdp-details-contact-phone>
</div>
<div>
<bdp-details-contact-fax contacts="vm.listing.contacts" class="ng-isolate-scope"><!-- ngIf: vm.getHaveFax()==true --> <!-- iconed-text-->
</bdp-details-contact-fax>
</div>
<div>
<bdp-details-abn-acn listing="vm.listing" class="ng-isolate-scope"><!-- ngIf: vm.haveAbn() -->
<!-- ngIf: vm.haveAcn() --></bdp-details-abn-acn>
</div>
</div>
答案 0 :(得分:1)
<强>分析:强>
如果您查看页面来源,则在没有<div class="column">
的情况下正文非常简单。
问题是网站会调用一些javascript
,然后重新编写html元素,您要查找的内容由js
写入。这就是为什么当您使用request
时,页面内容不会首先显示所有来源,您找不到元素<div class="column" ng-class="vm.getTabletClass()">
,返回将 NONE 强>
<强>解决方案:强>
1,如果你Inspect
网站上有chrome,你可以找到div
class="column"
,就像问题中的元素一样,那么也许你可以从这里抓取这一部分。但是,如果找不到特定的子元素,您的for loop
将获得div
class="column"
所有list index out of range
,并且您可能只需要第一个div
class="column"
获取电话,网络,电子邮件:titles[0]
。
2,也许你可以尝试使用像selenium
这样的webdriver组件来模拟网页浏览,并使用javascript渲染。
BTW:对于网络和电子邮件,范围随ng-click
而来,您的代码不适用于此部分