使用jsoup从“表格”中提取文字。具有可变页面数据的类

时间:2015-12-16 06:46:13

标签: java android html jsoup

首先发布在这里,所以我会尽力保持这一点。我一直在使用Jsoup从一系列网页中提取数据以引入一个优秀的应用程序。我遇到了一个页面,它根据下拉框中的用户选择动态更新数据。当我在Chrome中检查html时,我可以看到数据,但我似乎无法提取它。我可以提取它周围的所有文本元素,但动态生成的任何内容都不会出来。

我正在查看的页面有以下表格类别,为包装道歉,我无法摆脱它。



<form class="variations_form cart" method="post" enctype="multipart/form-data" data-product_id="8044" data-product_variations="[{&quot;variation_id&quot;:8047,&quot;variation_is_visible&quot;:true,&quot;variation_is_active&quot;:true,&quot;is_purchasable&quot;:true,&quot;display_price&quot;:19.70,&quot;display_regular_price&quot;:19.70,&quot;attributes&quot;:{&quot;attribute_size&quot;:&quot;500g&quot;},&quot;image_src&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-475x652.png&quot;,&quot;image_link&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann.png&quot;,&quot;image_title&quot;:&quot;LABELS_500g-FOOD Vann&quot;,&quot;image_alt&quot;:&quot;&quot;,&quot;image_srcset&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/08\/LABELS_500g-FOOD-Vann.png 1063w&quot;,&quot;image_sizes&quot;:&quot;(max-width: 475px) 100vw, 475px&quot;,&quot;price_html&quot;:&quot;<span class=\&quot;price\&quot;><span class=\&quot;amount\&quot;>$19.70<\/span><\/span>&quot;,&quot;availability_html&quot;:&quot;&quot;,&quot;sku&quot;:&quot;FOOD-Vanilla-500&quot;,&quot;weight&quot;:&quot;.5 kg&quot;,&quot;dimensions&quot;:&quot;&quot;,&quot;min_qty&quot;:1,&quot;max_qty&quot;:&quot;&quot;,&quot;backorders_allowed&quot;:false,&quot;is_in_stock&quot;:true,&quot;is_downloadable&quot;:false,&quot;is_virtual&quot;:false,&quot;is_sold_individually&quot;:&quot;no&quot;,&quot;variation_description&quot;:&quot;<p>500g<\/p>\n&quot;},{&quot;variation_id&quot;:8045,&quot;variation_is_visible&quot;:true,&quot;variation_is_active&quot;:true,&quot;is_purchasable&quot;:true,&quot;display_price&quot;:13.50,&quot;display_regular_price&quot;:13.50,&quot;attributes&quot;:{&quot;attribute_size&quot;:&quot;1kg&quot;},&quot;image_src&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-475x652.png&quot;,&quot;image_link&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van.png&quot;,&quot;image_title&quot;:&quot;LABELS_1kg-FOOD Van&quot;,&quot;image_alt&quot;:&quot;&quot;,&quot;image_srcset&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_1kg-FOOD-Van.png 1063w&quot;,&quot;image_sizes&quot;:&quot;(max-width: 475px) 100vw, 475px&quot;,&quot;price_html&quot;:&quot;<span class=\&quot;price\&quot;><span class=\&quot;amount\&quot;>$13.50<\/span><\/span>&quot;,&quot;availability_html&quot;:&quot;&quot;,&quot;sku&quot;:&quot;FOOD-Vanilla-1kg&quot;,&quot;weight&quot;:&quot;1 kg&quot;,&quot;dimensions&quot;:&quot;&quot;,&quot;min_qty&quot;:1,&quot;max_qty&quot;:&quot;&quot;,&quot;backorders_allowed&quot;:false,&quot;is_in_stock&quot;:true,&quot;is_downloadable&quot;:false,&quot;is_virtual&quot;:false,&quot;is_sold_individually&quot;:&quot;no&quot;,&quot;variation_description&quot;:&quot;<p>1kg<\/p>\n&quot;},{&quot;variation_id&quot;:8046,&quot;variation_is_visible&quot;:true,&quot;variation_is_active&quot;:true,&quot;is_purchasable&quot;:true,&quot;display_price&quot;:199.95,&quot;display_regular_price&quot;:199.95,&quot;attributes&quot;:{&quot;attribute_size&quot;:&quot;3kg&quot;},&quot;image_src&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-475x652.png&quot;,&quot;image_link&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van.png&quot;,&quot;image_title&quot;:&quot;LABELS_3kg-FOOD Van&quot;,&quot;image_alt&quot;:&quot;&quot;,&quot;image_srcset&quot;:&quot;http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-746x1024.png 746w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van-475x652.png 475w, http:\/\/www.sourcewebsite.com\/wp-content\/uploads\/2014\/09\/LABELS_3kg-FOOD-Van.png 1063w&quot;,&quot;image_sizes&quot;:&quot;(max-width: 475px) 100vw, 475px&quot;,&quot;price_html&quot;:&quot;<span class=\&quot;price\&quot;><span class=\&quot;amount\&quot;>$199.95<\/span><\/span>&quot;,&quot;availability_html&quot;:&quot;&quot;,&quot;sku&quot;:&quot;FOOD-Vanilla-3kg&quot;,&quot;weight&quot;:&quot;3 kg&quot;,&quot;dimensions&quot;:&quot;&quot;,&quot;min_qty&quot;:1,&quot;max_qty&quot;:&quot;&quot;,&quot;backorders_allowed&quot;:false,&quot;is_in_stock&quot;:true,&quot;is_downloadable&quot;:false,&quot;is_virtual&quot;:false,&quot;is_sold_individually&quot;:&quot;no&quot;,&quot;variation_description&quot;:&quot;<p>3kg<\/p>\n&quot;}]">

  <table class="variations" cellspacing="0">
    <tbody>
      <tr>
        <td class="label">
          <label for="size">Size</label>
        </td>
        <td class="value">
          <select id="size" class="" name="attribute_size" data-attribute_name="attribute_size">
            <option value="">Choose an option</option>
            <option value="500g">500g</option>
            <option value="1kg" selected="selected">1kg</option>
            <option value="3kg">3kg</option>
          </select><a class="reset_variations" href="#" style="visibility: visible; display: block;">Clear selection</a>	
        </td>
      </tr>
    </tbody>
  </table>

  <div class="angelleye_buton_box_relative" style="position: relative;">

    <div class="single_variation_wrap">
      <div class="woocommerce-variation-description" style="border: 1px solid transparent;">
        <p>1kg</p>
      </div>
      <div class="single_variation"><span class="price"><span class="amount selectorgadget_selected">$13.50</span></span>
      </div>
      <div class="variations_button">
        <div class="quantity">
          <input type="number" step="1" name="quantity" value="1" title="Qty" class="input-text qty text" size="4" min="1">
        </div>
        <button type="submit" class="single_add_to_cart_button button alt">Add to basket</button>
        <input type="hidden" name="add-to-cart" value="8044">
        <input type="hidden" name="product_id" value="8044">
        <input type="hidden" name="variation_id" class="variation_id" value="8045">
      </div>
    </div>

    <div class="blockUI blockOverlay angelleyeOverlay" style="display:none;z-index: 1000; border: none; margin: 0px; padding: 0px; width: 100%; height: 100%; top: 0px; left: 0px; opacity: 0.6; cursor: default; position: absolute; background: url(http://www.sourcewebsite.com/wp-content/plugins/woocommerce/assets/images/select2-spinner.gif) 50% 50% / 16px 16px no-repeat rgb(255, 255, 255);"></div>
  </div>

</form>
&#13;
&#13;
&#13;

我正试图提取价格&#34; 13.50&#34;从下面的div。

&#13;
&#13;
<div class="single_variation"><span class="price"><span class="amount selectorgadget_selected">$13.50</span></span>
</div>
&#13;
&#13;
&#13;

我的代码如下:

    private class ParseFoodPriceURL extends AsyncTask<String, Void, String> {

    @Override
    protected String doInBackground(String... strings) {
        StringBuffer buffer = new StringBuffer();
        try {
            Document doc = Jsoup.connect(strings[0]).get();
            Elements foodPrice = doc.select("div.single_variation_wrap > div.single_variation");
            String priceTextSelection = foodPrice.text();
            buffer.append("Price: $" + priceTextSelection);

        }
        catch (Throwable t) {
            t.printStackTrace();
        }
        return buffer.toString();
    }

1 个答案:

答案 0 :(得分:1)

JSoup不是浏览器,因此它不会解释和执行JavaScript。如果网站的内容是动态生成的,则无法直接使用JSoup。我想到了两个选择:

  1. 直接识别AJAX调用并通过这些调用获取信息。通常,响应不是HTML而是JSON。所以你可能需要其他解析库。此选项很快,但您需要调查并了解网页的工作方式。

  2. selenium webdriver与真实的浏览器引擎(例如phantomjs)一起使用。这将像真正的浏览器一样加载网站,但您可以访问类似于JSoup的内容。这相对容易编程,但速度慢并且使用了大量资源。如果你在android中运行,这可能太多了。无论如何,Android的正确工具似乎是Selenoid