在python脚本

时间:2017-08-23 10:05:38

标签: python python-3.x web-scraping css-selectors

我已经在python中编写了一些代码,以便从网页上获取公司详细信息和名称。我在我的脚本中使用了css选择器来收集这些项目。但是,当我运行它时,我会得到公司的详细信息和#34;和"联系"只有第一部分用" br"标记出一个完整的字符串。除了我得到的以外,我怎样才能获得完整的部分?

脚本我尝试用:

import requests ; from lxml import html

tree = html.fromstring(requests.get("https://www.austrade.gov.au/SupplierDetails.aspx?ORGID=ORG8000000314&folderid=1736").text)
for title in tree.cssselect("div.contact-details"):
    cDetails = title.cssselect("h3:contains('Contact Details')+p")[0].text
    cContact = title.cssselect("h4:contains('Contact')+p")[0].text
    print(cDetails, cContact)

搜索结果所在的元素:

<div class="contact-details block dark">
                <h3>Contact Details</h3><p>Company Name: Distance Learning Australia Pty Ltd<br>Phone: +61 2 6262 2964<br>Fax: +61 2 6169 3168<br>Email: <a href="mailto:rto@dla.com.au">rto@dla.com.au</a><br>Web: <a target="_blank" href="http://dla.edu.au">http://dla.edu.au</a></p><h4>Address</h4><p>Suite 108A, 49 Phillip Avenue<br>Watson<br>ACT<br>2602</p><h4>Contact</h4><p>Name: Christine Jarrett<br>Phone: +61 2 6262 2964<br>Fax: +61 2 6169 3168<br>Email: <a href="mailto:chris.jarrett@dla.com.au">chris.jarrett@dla.com.au</a></p>
            </div>

结果我得到了:

Company Name: Distance Learning Australia Pty Ltd Name: Christine Jarrett

结果我之后:

Company Name: Distance Learning Australia Pty Ltd
Phone: +61 2 6262 2964
Fax: +61 2 6169 3168
Email: rto@dla.com.au

Name: Christine Jarrett
Phone: +61 2 6262 2964
Fax: +61 2 6169 3168
Email: chris.jarrett@dla.com.au
不过,我的目的是仅使用选择器进行上述操作,而不是xpath。提前谢谢。

2 个答案:

答案 0 :(得分:1)

只需使用text方法替换text_content()属性,如下所示,以获得所需的输出:

cDetails = title.cssselect("h3:contains('Contact Details')+p")[0].text_content()
cContact = title.cssselect("h4:contains('Contact')+p")[0].text_content()

答案 1 :(得分:1)

text返回第一个文本节点。如果要在抓取文本节点时迭代所有子节点,请使用xpath,如:

company_details = title.cssselect("h3:contains('Contact Details')+p")[0]
for node in company_details.xpath("child::node()"):
    print node

结果:

Company Name: Distance Learning Australia Pty Ltd
<Element br at 0x7f625419eaa0>
Phone: +61 2 6262 2964
<Element br at 0x7f625419ed08>
Fax: +61 2 6169 3168
<Element br at 0x7f625419e940>
Email: 
<Element a at 0x7f625419e8e8>
<Element br at 0x7f625419eba8>
Web: 
<Element a at 0x7f6254155af8>