Question

我确定这个问题的答案很简单，但是经过数小时的研究和测试，我还没有解决问题。

这是问题所在。我最近开始使用硒从创建动态表的网站中收集信息。在测试期间，我注意到在查看收集的数据时遇到了一些问题。经过一些数据检查之后，我注意到某些表字段缺少文本，这会导致错误，这些错误会在代码的第二部分中显示。我决定绕过我的代码中的这些表条目，但仍然出现错误，因此我的代码不正确。

# I'm obtaining the <td> tags in the table
# with this.
td = row.find_elements_by_xpath(".//td")

# I slice out the desired items this way
# This outputs a <class 'str'>
td[3].text

# I found that this item has no text in some 
# table rows, which causes issues. I have tried 
# using the following to catch and bypass these
# rows

if not td[3].text:
   pass
else:
  # run some code
  # harvest the entire row


if len(td[3].text) != 0:
  # run some code
  # harvest the entire row
else:
  pass 


if len(td[3].text) == 11:
  # run some code
  # harvest the entire row
else:
  pass 


if td[3].text) != '':
  # run some code
  # harvest the entire row
else:
  pass 

# this element is the one that might be empty
td_time = row.find_element_by_xpath(".//td[4]/span/time")
if (len(td_time.text)) != 11:
   print ('no')
elif (len(td_time.text)) == 11:
   print ('yes')

我要抓取的表有五列。最后一列包含日期，某些包含较旧数据的行中缺少这些日期。

# Example with date
<td headers="th-date th-4206951" class="td-date">
   <b class="cell-label ng-binding">Publish Date</b>
   <span class="cell-content"><time datetime="2019-06-05T00:00:00Z" class="ng-binding">04 Jun 2019</time></span>
</td>

# Example without date
<td headers="th-date th-2037023" class="td-date">
  <b class="cell-label ng-binding">Publish Date</b>
  <span class="cell-content"><time datetime="" class="ng-binding"></time></span>
</td>

这些代码示例均未捕获空文本块，因此在对收集的数据进行后处理时会引起问题。

所以我的问题是：如何绕过使用XPATH获得的没有文本的元素？

Answer 1

我只需检查以下元素。

rows = driver.find_elements_by_xpath("//table[starts-with(@id,'mytable')]/tbody/tr[not(td[string-length(normalize-space(text()))=0])]")
for r in rows:
    columns = r.find_elements_by_tag_name('td')
    for col in columns:
        print (col.text)

示例HTML：

<html><head></head><body><table border="1" id="mytable">
	<tbody><tr>
		<td>1</td>
		<td></td>
		<td>FR</td>
	</tr>
	<tr>
		<td>2</td>
		<td>SR</td>
		<td></td>
	</tr>
	<tr>
		<td></td>
		<td></td>
		<td>TR</td>
	</tr>
	<tr>
		<td>4</td>
		<td> </td>
		<td>Checking cell with only space</td>
	</tr>
	<tr>
		<td>5</td>
		<td>All</td>
		<td>Rows</td>
	</tr>
</tbody></table>
</body></html>

这是获取所有没有空单元格的行的JQuery。

var list_of_cells =[];
$x("//table[starts-with(@id,'mytable')]/tbody/tr[not(td[string-length(normalize-space(text()))=0])]").forEach(function(row){
 var colData= [];
 row.childNodes.forEach(function(col){
 if(col.nodeType!=3){
    colData.push(col.textContent.trim())}
 })
list_of_cells.push(colData);
} );
console.log(list_of_cells);

使用Selenium和XPath跳过/跳过包含无文本的单元格的表行

1 个答案: