使用选择器收集某些搜索的值

时间:2017-08-22 22:03:42

标签: python python-3.x web-scraping css-selectors

运行我用python编写的脚本我可以完美地获得名称。但是,如果是手机和地址,我会得到" ph。"和"电子邮件"结果就像下面而不是它的价值。我怎样才能获得" ph。"的价值。和"电子邮件"使用选择器。

结果我有:

arkLAB Architecture Ph. Email
Conrad Gargett Ph. Email
MONDO ARCHITECTS Ph. Email

脚本我试图通过以下方式获得结果:

import requests 
from lxml import html

main_url = "http://www.findanarchitect.com.au/index.php"

def get_content(link):

    payload = {'action':'show_search_result','action_spam':'dDfgEr','txtSearchType':5,'txtPracName':'','optSstate':3,'optRegions':23,'txtPcode':'','txtShowBuildingType':0,'optBuildingType':1,'optHomeType':1,'optBudget':''}
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36'}
    tree = html.fromstring(requests.post(link, data = payload, headers = headers).text)

    for title in tree.cssselect("div#searchresultaplus"):
        names = title.cssselect("h2")[0].text
        phone = title.cssselect("div p > strong:contains('ph.')")[0].text
        email = title.cssselect("div p > strong:contains('Email')")[0].text
        print(names, phone, email)

get_content(main_url)

值所在的元素:

<div id="searchresultsapluscont">    
        <h2>Hugh Gordon Architect P/L</h2>
            <div id="archdetails">
            <div style="float:left">
                <p>
                    Unit 5/6 Lonsdale Street <br>
                    BRADDON ACT 2612
                </p>
                <p>
                    <strong>Ph.</strong> 02 6253 4448<br>
                     <strong>Email</strong> info@hughgordon.com.au
                </p>
            </div>
            <div style="float:right" class="yogi_v"><div class="img_box">
    <img src="/img/aplusprofile.png" alt="aplus logo">
</div></div>    
            <div class="clearboth">
                        <div><img src="/img/fe_img/resultline.png"></div>
            <p><br>Our company has been designing homes, apartments &amp; townhouses for the past two decades in the A.C.T. and N.S.W. This experience has allowed us to become a leading architecture firm, with great focus on the Multi-Residential sector. Due to our diverse team of designers, town planners, lawyers and Architects we are able to provide sophisticated and complex design solutions for all sectors of the Built Environment. With our head office based in Canberra, A.C.T. we are centrally located and conveniently placed to service both the Sydney, South Coast and Victorian regions.</p></div>

        </div>
        <div style="float:right">
        <a href="javascript:void(0);" onclick="js_show_profile('3796')"><img src="/files/profile_img/3796/4342_4_preview.jpg" alt="Feature Image"></a>
        </div>  
        <div class="clearboth">
            <div style="float:left;"><input type="image" src="/img/fe_img/btn_profileaplus.png" value="View profile" onclick="return js_show_profile('3796')" class="nopad">&nbsp;&nbsp;&nbsp;</div>
            <div style="float:left;"><input type="image" src="/img/fe_img/btn_awardsaplus.png" value="Awards" onclick="return js_show_awards('3796')" class="nopad">&nbsp;&nbsp;&nbsp;</div>
            <div id="idFavBtn_3796" style="padding-top:1px;"><a href="javascript: void(0)" onclick="js_addto_fav('3796','Hugh Gordon Architect P/L','1')"><img src="/img/addtofavaplus.png"></a></div>
        </div>
    </div>
是的,我不想在这里找xpath。提前谢谢。

1 个答案:

答案 0 :(得分:1)

使用tail属性。它包含直接跟在元素后面的文本,直到下一个元素。

names = title.cssselect("h2")[0].text
phone = title.cssselect("div p > strong:contains('ph.')")[0].tail.strip()
email = title.cssselect("div p > strong:contains('Email')")[0].tail.strip()