Question

我正在编写一个网络抓取工具，该工具可以从运动网站收集数据。有一些表，我想将每个tr的文本写到数组中。在某些行中，无法获得全文。

在t = ...之后的断点处调试时

element_table = WebDriverWait(driver, 20).until(
                EC.presence_of_all_elements_located((By.XPATH, '//table//tbody//tr')))

for count, e in enumerate(element_table):
    if count > 3:
        line = e.text.splitlines()
        t = e.text

调试器e的文本中的

是

text= {str} 'Salzburg\n4-3-1-2\n57%\n2 1.42\n14/4\n28.57%\n594/489\n82.32%\n66.7\n130\n12/43/75\n108\n38/48/22\n210/85\n40.48%'

但是当我看着t

t = {str} 'Salzburg\n4-3-1-2\n2 1.42\n14/4\n594/489\n66.7\n130\n108\n210/85',

那么element.text不能让我得到tr中的所有文本吗？它也只发生在几行上。

行无效，然后行有效：

<tr>
<td>Salzburg</td>
            <td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>4-3-1-2</em><small>57%</small></span></td>
            <td class="Index__video-cell___s1IHu"><span class="Index__stat-wrapper___n5jnZ">2</span><div class="Index__video-cell-icon___3Pnub"></div></td>
            <td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">1.42</span></td>
            <td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>14/4</em><small> 28.57%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
            <td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>594/489</em><small> 82.32%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
            <td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">66.7</span></td>
            <td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>130</em><small>12/43/75</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
            <td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>108</em><small>38/48/22</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
            <td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>210/85</em><small> 40.48%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
            </tr>

        <tr>
        <td>Sturm Graz</td>
        <td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>3-4-3</em><small>80%</small></span></td>
        <td class="Index__video-cell___s1IHu"><span class="Index__stat-wrapper___n5jnZ">3</span><div class="Index__video-cell-icon___3Pnub"></div></td><td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">1.73</span></td>
        <td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>14/7</em><small> 50%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
        <td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>484/400</em><small> 82.64%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
        <td class="Index__simple-cell-widget___1BYWx"><span class="Index__stat-wrapper___n5jnZ">49.41</span></td><td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>128</em><small>9/50/69</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
        <td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>101</em><small>33/50/18</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
        <td class="Index__video-cell-widget___3PDlg"><span class="Index__stat-wrapper___n5jnZ"><em>228/87</em><small> 38.16%</small></span><div class="Index__video-cell-icon___3Pnub"></div></td>
</tr>

Answer 1

好吧，我无法重现您使用python 2.7.10报告的问题。如果我推测的话，您提到过您稍后会在调试器中查看“ t”……其他代码是否操纵了“ t”？

我还建议，如果要拆分每一行的所有不同组件，则应将那些“ em”和“ small”元素称为单独的实体。这是一些演示代码：

    driver.get('file://path_to_html_from_above/text_attribute_missing_td_content.html')
    rows = WebDriverWait(self.driver, 5).until(
        EC.presence_of_all_elements_located((By.XPATH, '//table/tbody/tr')))

    for count, e in enumerate(rows):
        line = e.text.splitlines()
        t = e.text
        # this demonstrates that they have the same content
        self.assertEqual(line, t.splitlines())

    # storing a list of lists
    # representing rows of text, but splitting the content of the td in two if it has em and small HTML elements
    table_content = list()
    for row in rows:
        # pull out each column
        cols = row.find_elements_by_xpath('./td')
        r = list()
        for col in cols:
            # if the column has em and small elements grab those
            try:
                em = col.find_element_by_tag_name('em')
                r.append(em.text)
                small = col.find_element_by_tag_name('small')
                r.append(small.text)
            except NoSuchElementException:
                # otherwise, just get straight text
                r.append(col.text)

        table_content.append(r)


    print(table_content)

请注意，我将您的XPATH更改为特定于仅在表/正文中找到trs。

Answer 2

由于某些原因而不是使用Chrome驱动程序，firefox解决了该问题

Element.text数据丢失

2 个答案: