我正在尝试使用scrapy shell从数据库中仅删除联系人信息...
<div class="info-section">
<h3>State(s) Served:</h3>
<p>Nationwide (US)</p> </div>
<div class="info-section">
<h3>Year Founded:</h3>
<p>1985</p> </div>
<div class="info-section">
<h3>Description:</h3>
<p>Corporate tax accounting/consulting. Specialties: 280E Compliance/Planning, Research & Development Tax Credits, Cost Segregation, IRS Representation, Certified Financial Auditing.</p> </div>
<div class="info-section">
<h3>Contact:</h3>
<p><a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="93f1e1eaf2fdd3f0e3f2fef7bdf0fcfe">[email protected]</a> | 847-382-1166 X28</p>
</div>
我使用sel = response.css('.info-section')
选择了信息部分,然后可以遍历p
元素,但是如何只选择包含联系方式的<h3>
标签,然后获取<p>
文字?
答案 0 :(得分:1)
如果您需要在<p>
后面加上电子邮件中的<a>
文本,则可以尝试以下操作:
>>> txt = """<div class="info-section">
... <h3>State(s) Served:</h3>
... <p>Nationwide (US)</p> </div>
... <div class="info-section">
... <h3>Year Founded:</h3>
... <p>1985</p> </div>
...
... <div class="info-section">
... <h3>Description:</h3>
... <p>Corporate tax accounting/consulting. Specialties: 280E Compliance/Planning, Research & Development Tax Credits, Cost Segregation, IRS Representation, Certified Financial Auditing.</p> </div>
... <div class="info-section">
... <h3>Contact:</h3>
... <p><a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="93f1e1eaf2fdd3f0e3f2fef7bdf0fcfe">[email protected]</a> | 847-382-1166 X28</p>
... </div>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> sel.xpath('//h3[contains(text(), "Contact")]/following-sibling::p/a/following-sibling::text()').get()
u' | 847-382-1166 X28'
或更短,如@Jack Fleeting所说:
>>> sel.xpath('//h3[contains(text(), "Contact")]/following-sibling::p/text()').get()
u' | 847-382-1166 X28'