jsoup刮p标签

时间:2018-01-14 16:22:16

标签: jsoup screen-scraping

<div class="mcsColumnsTwoOne">
    <h1>                     I B D Distribution Ltd                            </h1>
    <p>Certificate Number: NAP 28766</p>
    <p>Date Certified: 08/10/2010</p>

    <p>Consumer Code: RECC</p>
    <p>Membership Number: 00038340</p>

    <h2>Company Address</h2>
    <p>Unit 11 Enterprise Park,Black Moor Road,Verwood,Dorset, BH31 6YS</p>

    <h2>Contact Details</h2>
    <p>Telephone: 01202 825682</p>
        <p>Website: <a href="http://www.ibd-distribution.com" title="                     I B D Distribution Ltd                            ">www.ibd-distribution.com</a></p>
        <p>Email: <span id="cloakc973703bbc5107b52e9fd9a2faf77e96"><a href="mailto:darren@ibd-distribution.com">darren@ibd-distribution.com</a></span><script type="text/javascript">
                document.getElementById('cloakc973703bbc5107b52e9fd9a2faf77e96').innerHTML = '';
                var prefix = '&#109;a' + 'i&#108;' + '&#116;o';
                var path = 'hr' + 'ef' + '=';
                var addyc973703bbc5107b52e9fd9a2faf77e96 = 'd&#97;rr&#101;n' + '&#64;';
                addyc973703bbc5107b52e9fd9a2faf77e96 = addyc973703bbc5107b52e9fd9a2faf77e96 + '&#105;bd-d&#105;str&#105;b&#117;t&#105;&#111;n' + '&#46;' + 'c&#111;m';
                var addy_textc973703bbc5107b52e9fd9a2faf77e96 = 'd&#97;rr&#101;n' + '&#64;' + '&#105;bd-d&#105;str&#105;b&#117;t&#105;&#111;n' + '&#46;' + 'c&#111;m';document.getElementById('cloakc973703bbc5107b52e9fd9a2faf77e96').innerHTML += '<a ' + path + '\'' + prefix + ':' + addyc973703bbc5107b52e9fd9a2faf77e96 + '\'>'+addy_textc973703bbc5107b52e9fd9a2faf77e96+'<\/a>';
        </script></p>
        <p>Contact: Darren Johnson</p>
    <p>Contact Position: Director</p>
    <hr>
    <h2 style="margin: 10px 0 0 0">Contact Installer</h2>

    <form name="contact" action="" method="post" class="formstyle" style="width: 100%">
        <fieldset>
            <p>(<em>*</em>) Denotes required field </p>
            <label for="name">Name <em>*</em></label>
            <input type="text" name="name" id="name" class="text" required="">
            <br class="clear">
            <div class="thepot">
                <label for="emailaddress">Email</label>
                <input type="text" name="emailaddress" id="emailaddress">
            </div>
            <label for="email">Email <em>*</em></label>
            <input type="email" name="email" id="email" class="text" required="">
            <br class="clear">
            <label for="telephone">Telephone</label>
            <input type="tel" name="telephone" id="telephone" class="text">
            <br class="clear">
            <label for="enquiry">Enquiry <em>*</em></label>
            <textarea name="enquiry" id="enquiry" rows="10" cols="10" required=""></textarea>
            <br class="clear">
            <input type="hidden" name="loadtime" value="1515943155">
            <input id="submitbutton" name="submitbutton" value="Submit" type="submit">
            <div class="thepot">
                    <label for="submitForm">submitForm</label>
                    <input type="text" name="submitForm" id="submitForm" value="">
            </div>
        </fieldset>
    </form>

    </div>**strong text**

我试图用java jsoup库从上面的代码中提取数据,虽然当'Contact'p标签为空时出现错误,'contact'p标签上会显示'Contact Position'如何进行联系列空白时显示空白,并在最后一列保留联系人位置p文本?非常感谢您的帮助 。     for(元素d:数据){

               idrow++;

         String Consumers = d.select("h1").text();
         String CertificateNumberall = d.select("p:eq(1)").text();
         String CertificateNumber = CertificateNumberall.substring(CertificateNumberall.lastIndexOf(":") + 1);
         String DateCertifiedall = d.select("p:eq(2)").text();
         String DateCertified = DateCertifiedall.substring(DateCertifiedall.lastIndexOf(":") + 1);
         String ConsumerCodeAll = d.select("p:eq(3)").text();
         String ConsumerCode = ConsumerCodeAll.substring(ConsumerCodeAll.lastIndexOf(":") + 1);
         String MembershipNumberAll = d.select("p:eq(4)").text();
         String MembershipNumber = MembershipNumberAll.substring(MembershipNumberAll.lastIndexOf(":") + 1);
         String CompanyAddressAll = d.select("p:eq(6)").text();
         String CompanyAddress = CompanyAddressAll.substring(CompanyAddressAll.lastIndexOf(":") + 1);
         String TelephoneAll = d.select("p:eq(8)").text();
         String Telephone = TelephoneAll.substring(TelephoneAll.lastIndexOf(":") + 1);
         String WebsiteAll = d.select("p:eq(9) :not(span)").text();
         String Website = WebsiteAll.substring(WebsiteAll.lastIndexOf(":") + 1);
         String EmailAll = d.select("p:eq(10) span").text();
         String Email = EmailAll.substring(EmailAll.lastIndexOf(":") + 1);
         String ContactAll = d.select("p:eq(11)").text();
         String Contact = ContactAll.substring(ContactAll.lastIndexOf(":") + 1);
         String ContactPositionAll = d.select("p:eq(12)").next("hr").text();
         String ContactPosition = ContactPositionAll.substring(ContactPositionAll.lastIndexOf(":") + 1);

1 个答案:

答案 0 :(得分:0)

当您想要获得空白数据(String contact = nullString contact = "")时,请联系&#39; p标签是空的。是对的吗?

根据您的报废页面,p元素中没有类或ID。因此,基本上您无法识别特定信息,例如&#39;联系人&#39;

我的建议是&#34;选择包含指定文字的元素&#34;。(jsoup API - selector)如果没有,则返回null

String contact = d.select("p:contains(Contact:)").get(1);

或者您可以使用:matches(regex)

String contact = d.select("p:matches(^Contact:.*)").get(1); 

另外,在Java惯例中,Variable应该是初始小写字母,例如&#39; contactAll&#39;(它由&#39;较低的Camel案例&#39;调用)。 这些文章可能会有所帮助: The Java™ Tutorials- Variables - Naming / Using Java Naming Conventions

快乐的编码!