我正在尝试找到一个跨越div的URL。
在这种情况下,它与类" company_url"的链接;我之后。
<div class="links standard">
<span class="link">
<a href="https://twitter.com/abacus" class="twitter_url icon_link fontello-twitter" rel="nofollow" target="_blank"></a>
</span>
<span class="link">
<a href="http://www.facebook.com/abacuslabs" class="facebook_url icon_link fontello-facebook" rel="nofollow" target="_blank"></a>
</span>
<span class="link">
<a href="https://www.linkedin.com/company/abacus-labs" class="linkedin_url icon_link fontello-linkedin" rel="nofollow" target="_blank"></a>
</span>
<span class="link">
<a href="http://blog.abacus.com/" class="blog_url icon_link fontello-rss" rel="nofollow" target="_blank"></a>
</span>
<span class="link">
<a href="http://abacus.com" class="company_url" rel="nofollow" target="_blank">abacus.com</a>
</span>
</div>
&#13;
我已经测试了我的xpath以查找页面中的div和div中的链接。因此,我非常确信它们是正确的(我使用http://www.freeformatter.com/xpath-tester.html#ad-output)。
但是当我运行代码时,什么都没有被删除。我做错了什么?
from scrapy import Spider
from scrapy.selector import Selector
import datetime
from saas.items import StartupItem
class StackSpider(Spider):
name = "abacus"
allowed_domains = ["angel.co"]
start_urls = [
"https://angel.co/abacus",
]
def parse(self, response):
questions = Selector(response).xpath('//div[contains(@class, "links standard")]')
for question in questions:
item = StartupItem()
item['startupurl'] = question.xpath('/span[@class="link"]/a[@class="company_url"]/@href').extract()[0]
item['source'] = 'angel.co'
item['datetime'] = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
yield item
答案 0 :(得分:1)
您需要使用两个//
:
'.//span[@class="link"]/a[@class="company_url"]/@href'
一旦你这样做,你将得到你的网址:
In [2]: from lxml import html
In [3]: x = html.fromstring(h)
In [4]: d = x.xpath('//div[@class="links standard"]')[0]
In [5]: d
Out[5]: <Element div at 0x7f13c0a00208>
In [6]: d.xpath('/span[@class="link"]/a[@class="company_url"]/@href')
Out[6]: []
In [7]: d.xpath('.//span[@class="link"]/a[@class="company_url"]/@href')
Out[7]: ['http://abacus.com']
这是正确的xpath但你需要添加一个用户代理,如果你在scrapy shell中执行view(response)
,你会看到:
添加用户代理:
~$ scrapy shell -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36" https://angel.co/abacus
然后运行上面的代码:
In [7]: d = response.xpath('//div[@class="links standard"]')[0]
In [8]: d.xpath('/span[@class="link"]/a[@class="company_url"]/@href').extract_first()
In [9]: d.xpath('.//span[@class="link"]/a[@class="company_url"]/@href').extract_first()
Out[9]: u'http://abacus.com'