解析没有ID的HTML表格

时间:2018-08-07 20:00:08

标签: java html parsing jsoup

im尝试使用以下代码从以下地址http://www.dolarhoy.com/获取值:

  try {
     URL urlPagina = new URL(url);
     URLConnection urlConexion = urlPagina.openConnection();
     urlConexion.connect();

     // Creamos el objeto con el que vamos a leer
     BufferedReader lector = new BufferedReader(new InputStreamReader(
           urlConexion.getInputStream(), "UTF-8"));
     String linea = "";
     String contenido = "";

     while ((linea = lector.readLine()) != null) {
        resultado.append(String.valueOf(linea));
        resultado.append("\n");
     }

  } catch (Exception e) {
     e.printStackTrace();
  }

  System.out.println("Contenido : \n\n" + resultado.toString());
  return resultado.toString();

}

我在其他代码之间得到了这个

<td width='113' height='25'>

  <div align='center'>

    <font face='Verdana, Arial, Helvetica, sans-serif' color='#00ff00' size='2'>ACTUALIZADO</font>

  </div>

</td>

<td width='179' height='25'>

  <div align='center'>

    <font face='Verdana, Arial, Helvetica, sans-serif' color='#00ff00' size='2'><b>7/08/2018&nbsp;

    14:53 AR</b></font>

  </div>

</td>

<td width='82' height='25'>

  <div align='center'>

    <font face='Verdana, Arial, Helvetica, sans-serif' color='#00ff00' size='2'>COMPRA</font>

  </div>

</td>

<td width='110' height='25'>

  <div align='center'>

    <font face='Verdana, Arial, Helvetica, sans-serif' color='#000000' size='2'><b><font face='Courier New, Courier, mono' color='#FFCC00' size='4'>$&nbsp;

    26.93</font></b></font>

  </div>

</td>

<td width='85' height='25'>

  <div align='center'>

    <font face='Verdana, Arial, Helvetica, sans-serif' color='#00ff00' size='2'>VENTA</font>

  </div>

</td>

<td width='110' height='25'>

  <div align='center'>

    <font face='Verdana, Arial, Helvetica, sans-serif' color='#000000' size='2'><b><font face='Courier New, Courier, mono' color='#FFCC00' size='4'>$&nbsp;

    27.93</font></b></font>

  </div>

</td>

但是我看到html表没有id。

我需要获取的值是图像中突出显示的值。

enter image description here

我需要上面的html代码“ 27.93”中显示的值。 (此值各不相同,因此我需要标记之间的内容)

我非常感谢您的帮助/解决方案。谢谢!

3 个答案:

答案 0 :(得分:0)

Firefox可以为该元素提供XPath或CSS选择器,这是该值的XPath:

/html/body/div[5]/center/table/tbody/tr/td[6]/div/font/b/font

使用您选择的XPath库提取值。

这是可以与JSOUP一起使用的CSS选择器

/body > div:nth-child(7) > center:nth-child(1) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(1) > td:nth-child(6) > div:nth-child(1) > font:nth-child(1) > b:nth-child(1) > font:nth-child(1)

答案 1 :(得分:0)

使用jsoup伪选择器,您可以执行以下操作:

    Document doc = Jsoup.connect("http://www.dolarhoy.com/").get();
    //select div element that contains specific text and is direct descenant of body 
    Element title = doc.select("body > div:contains(PROMEDIO DE COTIZACIONES DE PIZARRAS AL PÚBLICO RELEVADAS POR)").first();
    //select next sibling element with summary
    Element summary = title.nextElementSibling();
    //select last cell with data needed
    String amount = summary.select("td").last().text();
    System.out.println(amount);


    //same as above - one-liner
    System.out.println(doc.select("body > div:contains(PROMEDIO DE COTIZACIONES DE PIZARRAS AL PÚBLICO RELEVADAS POR) + div td:last-child").text());

更多信息可以在这里找到: https://jsoup.org/cookbook/extracting-data/selector-syntax

答案 2 :(得分:0)

使用univocity-html-parser,您可以从此页面获取所有内容。

只需获取您需要的元素,就不必太在乎它的完整路径:

HtmlElement e = HtmlParser.parseTree(new UrlReaderProvider("http://www.dolarhoy.com/"));

String value = e.query()
            .match("td").withText("$*") //match a <td> with any text starting with a $
            .precededImmediatelyBy("td").withText("VENTA") //if found, it must have a <td> on its left, with text "VENTA"
            .getText().getValue(); // if found, get the text of the the <td> and return the value as a String

这为我带来了$ 28.17的价值。

现在,要获取可用记录中所有表的所有值:

HtmlEntityList entityList = new HtmlEntityList();
HtmlEntitySettings currency = entityList.configureEntity("currency");
// removes rows with unwanted data
currency.addRecordFilter((record, context) -> isValidRecord(record));


//the group enables the matching rules to run only on tables that have text
//"compra" and "venta". We add fields to the group.
Group currencyTable = currency.newGroup().startAt("table").containing("tr").withText("*Compra ", "*Venta ").endAtClosing("table");

//the currency name and time are in the same table cell. The matching rule is the same for both "currency" and "timestamp" fields
addIdentifierField(currencyTable, "currency", 0);
addIdentifierField(currencyTable, "timestamp", 1);

//captures the currency exchange business name
currencyTable.addPersistentField("exchange").match("td").underHeaderAtRow("td", 3).withExactText("EN $").getText();
//captures the currency purchase and sale price
currencyTable.addField("buy").match("td").withText("?*").underHeaderAtRow("td", 3).withExactText("Compra").getText();
currencyTable.addField("sell").match("td").withText("?*").underHeaderAtRow("td", 3).withExactText("Venta").getText();

//additional matching rules to get the dollar prices listed in the first table (it has id = "table2")
currencyTable.addPersistentField("exchange").match("table").id("table2").match("tr").matchFirst("td").withText("?*").getText();
currencyTable.addField("buy").match("table").id("table2").match("td").withText("?*").underHeader("td").withExactText("Compra").getText();
currencyTable.addField("sell").match("table").id("table2").match("td").withText("?*").underHeader("td").withExactText("Venta").getText();

HtmlParser parser = new HtmlParser(entityList);
Results<HtmlParserResult> results = parser.parse(new UrlReaderProvider("http://www.dolarhoy.com/"));

HtmlParserResult result = results.get("currency");
for (HtmlRecord record : result.iterateRecords()) {
    println(record.fillFieldMap(new LinkedHashMap<String, String>()));
}

方法addIdentifierField定义为:

private void addIdentifierField(Group table, String field, final int pos) {
    //matches any <td> where the colspan attribute is 4, 5 or 6, then gets the text of the <b> element inside the <td>
    table.addPersistentField(field).match("td").attribute("colspan", 4, 5, 6).match("b").getText().transform(s -> splitCurrencyAndTime(s)[pos]);
}

方法splitCurrencyAndTime

// splits the currency and timestamp at the top of each table. Finds the first
// non-letter character after counting multiple whitespaces and splits the string in two
private String[] splitCurrencyAndTime(String value) {
    int spaceCount = 0;
    for (int i = 0; i < value.length(); i++) {
        char ch = value.charAt(i);
        if (ch == ' ') {
            spaceCount++;
        } else if (spaceCount > 0 && !Character.isLetter(ch) && ch != '$') {
            String currency = value.substring(0, i).trim();
            String timestamp = value.substring(i).trim();
            return new String[]{currency, timestamp};
        }
    }

    //if no match then just return nulls
    return new String[2];
}

最后,方法isValidRecord防止摆脱诸如{currency=EURO, timestamp=15:35:39 HS. AR 10/08/18, exchange=MEJORES PRECIOS, buy=34.000, sell=34.702}之类的结果:

private boolean isValidRecord(Record record){
    String exchange = record.getString("exchange");
    return exchange != null && !exchange.contains("MEJORES") && !exchange.contains("DolarHoy.com");
}

输出将是:

{currency=DÓLAR ESTADOUNIDENSE EN $, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe, buy=28.500, sell=29.500}
{currency=DÓLAR ESTADOUNIDENSE EN $, timestamp=15:35:39 HS. AR 10/08/18, exchange=Banco Nación, buy=28.700, sell=29.700}
{currency=DÓLAR ESTADOUNIDENSE EN $, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio, buy=28.000, sell=29.300}
{currency=EURO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Banco Nación, buy=34.000, sell=35.000}
{currency=EURO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=33.502, sell=34.702}
{currency=EURO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio S.A., buy=31.400, sell=35.200}
{currency=REAL, timestamp=15:35:39 HS. AR 10/08/18, exchange=Banco Nación, buy=7.0000, sell=8.0000}
{currency=REAL, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=6.8000, sell=7.4000}
{currency=REAL, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio S.A., buy=6.6000, sell=7.5000}
{currency=PESO URUGUAYO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=0.89060, sell=1.01720}
{currency=PESO URUGUAYO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio S.A., buy=0.75000, sell=1.00000}
{currency=PESO CHILENO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=0.04250, sell=0.05180}
{currency=PESO CHILENO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio S.A., buy=0.03600, sell=0.04600}
{currency=GUARANÍ, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=0.00440, sell=0.00590}
{currency=GUARANÍ, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio S.A., buy=0.00450, sell=0.00535}
{currency=FRANCO SUIZO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=24.7826, sell=31.0526}
{currency=FRANCO SUIZO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio S.A., buy=21.2000, sell=29.4000}
{currency=LIBRA ESTERLINA, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=38.7228, sell=41.7729}
{currency=LIBRA ESTERLINA, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio S.A., buy=35.4000, sell=44.3000}
{currency=YEN, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio S.A., buy=0.2456, sell=0.2745}
{currency=DÓLAR CANADIENSE, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio S.A., buy=18.950, sell=23.100}
{currency=PESO MEXICANO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio S.A., buy=1.500, sell=1.930}
{currency=DÓLAR AUSTRALIANO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Montevideo Cambio S.A., buy=15.150, sell=21.900}
{currency=LIBRA ORO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=7267.50, sell=9292.50}
{currency=KRUGER RAND, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=30780.00, sell=39235.00}
{currency=CHILENO DE ORO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=18097.50, sell=22715.00}
{currency=100 GRAMOS DE ORO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Banco Ciudad, buy=null, sell=110636.00}
{currency=100 GRAMOS DE ORO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=99180.00, sell=128325.00}
{currency=50 GRAMOS DE ORO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Banco Ciudad, buy=null, sell=55473.00}
{currency=50 GRAMOS DE ORO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=49590.00, sell=65195.00}
{currency=20 GRAMOS DE ORO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=19807.50, sell=25812.50}
{currency=10 GRAMOS DE ORO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Banco Ciudad, buy=null, sell=11343.00}
{currency=10 GRAMOS DE ORO, timestamp=15:35:39 HS. AR 10/08/18, exchange=Cambio Alpe S.A., buy=9975.00, sell=13275.00}

希望这对您有用。

披露:我是这个图书馆的作者。它是商业上的封闭源代码,但是可以节省很多开发时间。