如何使用JSoup(java)正确解析数据

时间:2014-07-18 08:23:46

标签: java parsing jsoup

我想使用JSoup(java)解析此HTML(CompanyName,Location,jobDescription,...)中的数据。在尝试迭代职位列表时我陷入困境

HTML的摘录是许多“JOBLISTING”div中的一个,我想迭代并从中提取数据。我只是无法处理如何迭代特定的div对象。抱歉这个noob问题,但也许有人可以帮助我已经知道使用哪个功能。选择?

<div class="between_listings"><!-- local.spacer --></div>

<div id="joblisting-2944914" class="joblisting listing-even listing-even company-98028 " itemscope itemtype="http://schema.org/JobPosting">


<div class="company_logo" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
     <a href="/stellenangebote-des-unternehmens--Delivery-Hero-Holding-GmbH--98028.html" title="Jobs Delivery Hero Holding GmbH" itemprop="url">
       <img src="/upload_de/logo/D/logoDelivery-Hero-Holding-GmbH-98028DE.gif" alt="Logo Delivery Hero Holding GmbH" itemprop="image" width="160" height="80" />
     </a>
</div>


<div class="job_info">


<div class="h3 job_title">
   <a id="jobtitle-2944914" href="/stellenangebote--Junior-Business-Intelligence-Analyst-CRM-m-f-Berlin-Delivery-Hero-Holding-GmbH--2944914-inline.html?ssaPOP=204&ssaPOR=203" title="Arbeiten bei Delivery Hero Holding GmbH" itemprop="url">
      <span itemprop="title">Junior Business Intelligence Analyst / CRM (m/f)</span>
   </a>
</div>

<div class="h3 company_name" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">

    <span itemprop="name">Delivery Hero Holding GmbH</span>

</div>

</div>




<div class="job_location_date">

    <div class="job_location target-location">
         <div class="job_location_info" itemprop="jobLocation" itemscope itemtype="http://schema.org/Place">


            <div class="h3 locality" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
                  <span itemprop="addressLocality"> Berlin</span>
            </div>


            <span class="location_actions">
                <a href="javaScript:PopUp('http://www.stepstone.de/5/standort.html?OfferId=2944914&ssaPOP=203&ssaPOR=203','resultList',800,520,1)" class="action_showlistingonmap showlabel" title="Google Maps" itemprop="maps">
                   <span class="location-icon"><!-- --></span>
                   <span class="location-label">Google Maps</span>
                </a>
            </span>

          </div>
       </div>

       <div class="job_date_added" itemprop="datePosted"><time datetime="2014-07-04">04.07.14</time></div>
</div>


<div class="job_actions">


</div>

</div>
<div class="between_listings"><!-- local.spacer --></div>

文件输入=新文件(“C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt”); //将文件加载到extract1文档ParseResult = Jsoup.parse(input,“UTF-8”,“http://example.com/”); Elements jobListingElements = ParseResult.select(“。joblisting”); for(element jobListingElement:jobListingElements){jobListingElement.select(“。companyName span [itemprop = \”name \“]”); //其他元素属性System.out.println(jobListingElements);

Java代码:

File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt");
// Load file into extraction1       
Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/");                          
Elements jobListingElements = ParseResult.select(".joblisting");        
for (Element jobListingElement: jobListingElements) {         
    jobListingElement.select(".companyName span[itemprop=\"name\"]");         
    // other element properties         
    System.out.println(jobListingElements);
}

谢谢!

1 个答案:

答案 0 :(得分:2)

所以你得到了你的Jsoup文件吗?如果css类joblisting没有出现在其他任何地方,似乎很容易。

Document document = Jsoup.parse(new File("d:/bla.html"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
    Elements jobTitleElement = element.select(".job_title span");
    Elements companyNameElement = element.select(".company_name spanspan[itemprop=name]");
    String companyName = companyNameElement.text();
    String jobTitle = jobTitleElement.text();

    System.out.println(companyName);
    System.out.println(jobTitle);
}

我不知道为什么属性[itemprop*=\"name\"]选择器找不到跨度(进一步阅读:http://jsoup.org/cookbook/extracting-data/selector-syntax

得到它:span [itemprop = name]没有任何引号或转义。其他属性或值也应该用于获得更具体的选择。