我正在编写一个网络抓取工具,该工具可以从运动网站收集数据。有一些表,我想将每个tr的文本写到数组中。在某些行中,无法获得全文。
在t = ...之后的断点处调试时
element_table = WebDriverWait(driver, 20).until(
EC.presence_of_all_elements_located((By.XPATH, '//table//tbody//tr')))
for count, e in enumerate(element_table):
if count > 3:
line = e.text.splitlines()
t = e.text
调试器e的文本中的是
text= {str} 'Salzburg\n4-3-1-2\n57%\n2 1.42\n14/4\n28.57%\n594/489\n82.32%\n66.7\n130\n12/43/75\n108\n38/48/22\n210/85\n40.48%'
但是当我看着t
t = {str} 'Salzburg\n4-3-1-2\n2 1.42\n14/4\n594/489\n66.7\n130\n108\n210/85',
那么element.text不能让我得到tr中的所有文本吗?它也只发生在几行上。
行无效,然后行有效:
<tr>
<td>Salzburg</td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>4-3-1-2</em><small>57%</small></span></td>
<td class="Index__video-cell___s1IHu"><span class="Index__stat-wrapper___n5jnZ">2</span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">1.42</span></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>14/4</em><small> 28.57%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>594/489</em><small> 82.32%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">66.7</span></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>130</em><small>12/43/75</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>108</em><small>38/48/22</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>210/85</em><small> 40.48%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
</tr>
<tr>
<td>Sturm Graz</td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>3-4-3</em><small>80%</small></span></td>
<td class="Index__video-cell___s1IHu"><span class="Index__stat-wrapper___n5jnZ">3</span><div class="Index__video-cell-icon___3Pnub"></div></td><td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">1.73</span></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>14/7</em><small> 50%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>484/400</em><small> 82.64%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">49.41</span></td><td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>128</em><small>9/50/69</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>101</em><small>33/50/18</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
<td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>228/87</em><small> 38.16%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
</tr>
答案 0 :(得分:0)
好吧,我无法重现您使用python 2.7.10报告的问题。 如果我推测的话,您提到过您稍后会在调试器中查看“ t”……其他代码是否操纵了“ t”?
我还建议,如果要拆分每一行的所有不同组件,则应将那些“ em”和“ small”元素称为单独的实体。这是一些演示代码:
driver.get('file://path_to_html_from_above/text_attribute_missing_td_content.html')
rows = WebDriverWait(self.driver, 5).until(
EC.presence_of_all_elements_located((By.XPATH, '//table/tbody/tr')))
for count, e in enumerate(rows):
line = e.text.splitlines()
t = e.text
# this demonstrates that they have the same content
self.assertEqual(line, t.splitlines())
# storing a list of lists
# representing rows of text, but splitting the content of the td in two if it has em and small HTML elements
table_content = list()
for row in rows:
# pull out each column
cols = row.find_elements_by_xpath('./td')
r = list()
for col in cols:
# if the column has em and small elements grab those
try:
em = col.find_element_by_tag_name('em')
r.append(em.text)
small = col.find_element_by_tag_name('small')
r.append(small.text)
except NoSuchElementException:
# otherwise, just get straight text
r.append(col.text)
table_content.append(r)
print(table_content)
请注意,我将您的XPATH更改为特定于仅在表/正文中找到trs。
答案 1 :(得分:0)
由于某些原因而不是使用Chrome驱动程序,firefox解决了该问题