我确定这个问题的答案很简单,但是经过数小时的研究和测试,我还没有解决问题。
这是问题所在。我最近开始使用硒从创建动态表的网站中收集信息。在测试期间,我注意到在查看收集的数据时遇到了一些问题。经过一些数据检查之后,我注意到某些表字段缺少文本,这会导致错误,这些错误会在代码的第二部分中显示。我决定绕过我的代码中的这些表条目,但仍然出现错误,因此我的代码不正确。
# I'm obtaining the <td> tags in the table
# with this.
td = row.find_elements_by_xpath(".//td")
# I slice out the desired items this way
# This outputs a <class 'str'>
td[3].text
# I found that this item has no text in some
# table rows, which causes issues. I have tried
# using the following to catch and bypass these
# rows
if not td[3].text:
pass
else:
# run some code
# harvest the entire row
if len(td[3].text) != 0:
# run some code
# harvest the entire row
else:
pass
if len(td[3].text) == 11:
# run some code
# harvest the entire row
else:
pass
if td[3].text) != '':
# run some code
# harvest the entire row
else:
pass
# this element is the one that might be empty
td_time = row.find_element_by_xpath(".//td[4]/span/time")
if (len(td_time.text)) != 11:
print ('no')
elif (len(td_time.text)) == 11:
print ('yes')
我要抓取的表有五列。最后一列包含日期,某些包含较旧数据的行中缺少这些日期。
# Example with date
<td headers="th-date th-4206951" class="td-date">
<b class="cell-label ng-binding">Publish Date</b>
<span class="cell-content"><time datetime="2019-06-05T00:00:00Z" class="ng-binding">04 Jun 2019</time></span>
</td>
# Example without date
<td headers="th-date th-2037023" class="td-date">
<b class="cell-label ng-binding">Publish Date</b>
<span class="cell-content"><time datetime="" class="ng-binding"></time></span>
</td>
这些代码示例均未捕获空文本块,因此在对收集的数据进行后处理时会引起问题。
所以我的问题是:如何绕过使用XPATH获得的没有文本的元素?
答案 0 :(得分:1)
我只需检查以下元素。
rows = driver.find_elements_by_xpath("//table[starts-with(@id,'mytable')]/tbody/tr[not(td[string-length(normalize-space(text()))=0])]")
for r in rows:
columns = r.find_elements_by_tag_name('td')
for col in columns:
print (col.text)
示例HTML:
<html><head></head><body><table border="1" id="mytable">
<tbody><tr>
<td>1</td>
<td></td>
<td>FR</td>
</tr>
<tr>
<td>2</td>
<td>SR</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>TR</td>
</tr>
<tr>
<td>4</td>
<td> </td>
<td>Checking cell with only space</td>
</tr>
<tr>
<td>5</td>
<td>All</td>
<td>Rows</td>
</tr>
</tbody></table>
</body></html>
这是获取所有没有空单元格的行的JQuery。
var list_of_cells =[];
$x("//table[starts-with(@id,'mytable')]/tbody/tr[not(td[string-length(normalize-space(text()))=0])]").forEach(function(row){
var colData= [];
row.childNodes.forEach(function(col){
if(col.nodeType!=3){
colData.push(col.textContent.trim())}
})
list_of_cells.push(colData);
} );
console.log(list_of_cells);