Jsoup从网站获取文本

时间:2015-12-16 01:07:21

标签: java jsoup

我已经可以在网站中导航并获得我想要的所有链接。但我的主要目标是获得酒店的评论。我正在使用的网站是http://www.booking.com/hotel/pt/park-italia-flat.pt-pt.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaLsBiAEBmAEvuAEEyAEE2AEB6AEB-AEL;sid=637e7af0c3009aa9ea132a960e2d2d40;dcid=4;ucfs=1;room1=A,A;srfid=b8260a1c264a3873291a9061733a43536a4d35c2X979#tab-reviews 我可以在哪里使用jsoup没问题,但现在我不知道如何获取文本。我已经尝试了getElementsByTaggetText以及其他解决方案。这可以用jsoup完成,或者我需要另一个库。 我正在尝试这种方式来获取文本。但出现的文字不是我想要的。

        Document doc ;
        try {
            doc = Jsoup.connect(pair.getValue().toString() + "#tab-reviews").get();
            Elements scriptElements = doc.getElementsMatchingText("span");
            for (Element link : scriptElements ) {
                System.out.printf(" Text: <%s> \n", link.text());
            }

        } catch (IOException ex) {
            Logger.getLogger(GetComentsThread.class.getName()).log(Level.SEVERE, null, ex);
        }

为了获取我使用类似内容的网址。

Pattern pattern = Pattern.compile("src=destinationfinder");
            Document doc = Jsoup.connect(url).get();
            Elements links = doc.select("a[href]");
            for (Element link : links) {
                Matcher matcher = pattern.matcher(link.attr("abs:href"));
                if (matcher.find()) {
                    dest = link.attr("abs:href");
                    break;
                }
            }

现在我可以得到一些评论,但只有积极的不知道为什么

doc = Jsoup.connect(pair.getValue().toString() + "#tab-reviews").get();
                    //doc = Jsoup.connect("http://www.booking.com/hotel/pt/pestanaportohotel.pt-pt.html?label=gen173nr-17CAEoggJCAlhYSDNiBW5vcmVmaLsBiAEBmAEvuAEEyAEE2AEB6AEB-AEL;sid=cff2dddd95e71c0768847a554584c888;dcid=4;dist=0;group_adults=2;room1=A%2CA;sb_price_type=total;srfid=798bd6b01ea1dba53ee6b6b945dda1f623859730X2;type=total;ucfs=1&#tab-reviews").get();
                    String teste="p.trackit";


                    Elements scriptElements = doc.select(teste);
                    for (Element link : scriptElements) {

                        //System.out.printf(" Text: <%s> ...%s\n", link.text(),link.attr("class=\"review_pos\""));
                        System.out.printf(" Text: <> ...%s\n",link.text());

                    }

2 个答案:

答案 0 :(得分:1)

使用对another url的AJAX请求加载评论。

在那里,您可以获得所需的所有信息。

响应:

<li class="
  review_item
  clearfix
  ">
  <p class="review_item_date">
    16 de Setembro de 2015
  </p>
  <div class="review_item_reviewer">
    <h4>
      Beatriz
    </h4>
    <span class="reviewer_country">
    <span class="reviewer_country_flag sflag slang-br">
    </span>
    Brasil
    </span>
  </div>
  <!-- .review_item_reviewer -->
  <div class="review_item_review">
    <div class="
      review_item_review_container
      lang_ltr
      seo_reviews_item
      ">
      <div class="review_item_review_header">
        <div class="
          review_item_header_score_container
          ">
          <div class="review_item_review_score jq_tooltip high_score_tooltip" title="
            Excepcional
            ">
            9,6
          </div>
        </div>
        <div class="review_item_header_content_container">
          <div class="review_item_header_content seo_review_title">
            Excepcional
          </div>
        </div>
      </div>
      <ul class="review_item_info_tags">
        <li class="review_info_tag"><span class="bullet">&bull;</span> Viagem de lazer</li>
        <li class="review_info_tag"><span class="bullet">&bull;</span> Família</li>
        <li class="review_info_tag"><span class="bullet">&bull;</span> Apartamento com Varanda</li>
        <li class="review_info_tag"><span class="bullet">&bull;</span> Ficou 5 noites</li>
        <li class="review_info_tag"><span class="bullet">&bull;</span> Submetido através de dispositivo móvel</li>
      </ul>
      <div class="review_item_review_content">
        <p class="review_pos"><i class="review_item_icon">&#45575;</i>Conforto, perto do centro, perto de um lindo mercado, bem decorado, com todo material necessário para fazer as refeições, Wi-Fi excelente</p>
      </div>
    </div>
  </div>
</li>

答案 1 :(得分:0)

看起来你只需要使用jsoup来获取内容 类=&#34; review_pos&#34;和class =&#34; review_neg&#34;