所以我是Scrapy的新手,我正在寻找一些事实证明有点过于雄心勃勃的事情。我希望有人可以帮助指导我如何收集和解析我从这个网站后的信息。
我需要获得以下内容: LABEL1 4810(这是动态生成的) 企业名称 名称 地址1 地址2 地址3 地址4 邮编 0800 111111 me@domain.com
这甚至可以使用scrapy吗?
非常感谢提前。
<div class="mbg">
<a href="http://www.domain.com" aria-label="label1"> <span class="nw1">Label13345</span>
</a>
<span class="mbg-l">
<a href="http://www.domain.com/1" title="FBS">4810</a>
<img
alt="4810"
title="4810"
src="http://www.domain.com/image1"></span>
</div>
<div id="bsi-c" class=" bsi-c-uk-bislr">
<div class="bsi-cnt">
<div class="bsi-ttl section-ttl">
<h2>Info</h2>
<div class="rd-sep"></div>
</div>
<div class="bsi-bn">Business name</div>
<div class="bsi-cic">
<div id="bsi-ec" class="u-flL">
<span class="bsi-arw"><a href="javascript:;"></a></span>
<span class="bsi-cdt"><a href="javascript:;">Contact details</a></span>
</div>
<div id="e8" class="u-flL bsi-ci">
<div class="bsi-c1">
<div>Name</div>
<div>Address1</div>
<div>Address2</div>
<div>Address3</div>
<div>Address4</div>
<div>Postcode</div>
</div>
<div class="bsi-c2">
<br></br>
<div>
<span class="bsi-lbl">Phone:</span>
<span>0800 111111</span>
</div>
<div>
<span class="bsi-lbl">Email:</span>
<span>me@domain.com</span>
</div>
</div>
</div>
</div>
答案 0 :(得分:1)
解析已经收到的页面的示例可能如下所示:
import lxml.html
page="""<div><span> . . .</span></div> """
doc = lxml.html.document_fromstring(page)
# get label1 4810
label = doc.cssselect('.mbg .mbg-l a')[0].text_content()
# get address
addres = doc.cssselect('.u-flL .bsi-c1')[0].text_content()
# get phone
phone = doc.cssselect('.bsi-c2 .bsi-lbl')[0].text_content()
# get mail
mail = doc.cssselect('.bsi-c2 .bsi-lbl')[1].text_content()
如果必须从网络中检索页面,则可以这样做:
import requests, lxml.html
page = requests.get('site_.com')
doc = lxml.html.document_fromstring(page.text)
phone = doc.cssselect('.bsi-c2 .bsi-lbl')[0].text_content()