Jsoup选择不返回所有元素

时间:2015-06-12 21:28:03

标签: java scala web-scraping jsoup

我是Jsoup Library的新手。我有像这样的HTML。

<tr class="srrowns"> 
 <td class="num"> <a name="y2015"> </a> 1 </td> 
 <td nowrap><a href="/cve/CVE-2015-4004/" title="CVE-2015-4004 security vulnerability details">CVE-2015-4004</a></td> 
 <td><a href="/cwe-details/119/cwe.html" title="CWE-119 - CWE definition">119</a></td> 
 <td class="num"> <b style="color:red"> </b> </td> 
 <td> DoS Overflow +Info </td> 
 <td>2015-06-07</td> 
 <td>2015-06-08</td> 
 <td>
  <div class="cvssbox" style="background-color:#ff8000">
   8.5
  </div></td> 
 <td align="center">None</td> 
 <td align="center">Remote</td> 
 <td align="center">Low</td> 
 <td align="center">Not required</td> 
 <td align="center">Partial</td> 
 <td align="center">None</td> 
 <td align="center">Complete</td> 
</tr>

当我运行element.select("td")时,它正在返回

<td class="num"> <a name="y2015"> </a> 1 </td>
<td nowrap><a href="/cve/CVE-2015-4004/" title="CVE-2015-4004 security vulnerability details">CVE-2015-4004</a></td>
<td><a href="/cwe-details/119/cwe.html" title="CWE-119 - CWE definition">119</a></td>
<td class="num"> <b style="color:red"> </b> </td>
<td> DoS Overflow +Info </td>
<td>2015-06-07</td>
<td>2015-06-08</td>
<td>
 <div class="cvssbox" style="background-color:#ff8000">
  8.5
 </div></td>
<td align="center">None</td>
<td align="center">Remote</td>
<td align="center">Low</td>
<td align="center">Not required</td>
<td align="center">Partial</td>
<td align="center">Complete</td>

很明显,在&#34; <td align="center">None</td>&#34;之前删除Complete。有什么办法可以从Jsoup Selector获得所有物品吗?

我的代码在Scala中看起来像这样。

val connection = Jsoup.connect(url).get() 
val treelist = connection.select("tr.srrowns:contains(CVE-2015-4001)")
val tree = tree.select("td") 

我刚看到Jsoup select是使用LinkedHashSet实现的。我的目标是使用Jsoup.text()从每个标签中提取文本。是否有解决方法或是否必须编写解析器以获取所有节点(包括重复项)?

非常感谢你。

1 个答案:

答案 0 :(得分:0)

试试这个CSS选择器:

@echo off

rem Storing the program parameters into the array 'params':
rem Delayed expansion is left disabled in order not to interpret "!" in program parameters' values;
rem however, if a parameter is not quoted, special characters in it (like "^", "&", "|") get interpreted at program launch
set /a count=0
:repeat
    set /a count+=1
    set "params_%count%=%~1"
    shift
    if defined params_%count% (
        goto :repeat
    ) else (
        set /a count-=1
    )    
set /a params_0=count

rem Printing the program parameters stored in the array 'params':
rem After the variables params_1 .. params_n are set with the program parameters' values, delayed expansion can
rem be enabled and "!" are not interpreted in the variables params_1 .. params_n values
setlocal enabledelayedexpansion
    for /l %%i in (1,1,!params_0!) do (
        echo params_%%i: "!params_%%i!"
    )
endlocal

pause
goto :eof

样本

http://try.jsoup.org/~vAgiHQY6TIJ5MSUzR-m_Y1GD5_U

示例代码

tr.srrowns:has(td:contains(CVE-2015-4004)) > td