Question

我在网站上的表格非常像这样：

<table class="table-class">
  <thead>
    <tr>
      <th>Col 1</th>
      <th>Col 2</th>
      <th>Col 3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
     <td>Hello</td>
     <td>A number</td>
     <td>Another number<td>
   </tr>
   <tr>
     <td>there</td>
     <td>A number</td>
     <td>Another number<td>
   </tr>
  </tbody>
</table>

最终，我想要做的是读取每行td的内容，并为每个相应的行生成一个包含所有三个单元格的字符串。此外，我希望能够扩展以使用相同的设计处理来自众多网站的更大的表格，因此速度在某种程度上是优先考虑的，但不是必需的。

我认为我必须使用像find_elements_by_xpath(...)或类似的东西，但我真的打了这个墙。我尝试过在其他网站上提出的几种方法，似乎做错了比做对错。任何形式的建议或想法将非常感谢！

我现在所拥有的，虽然不起作用，并且基于此处的另一个问题，但是：

listoflist = [[td.text
                for td in tr.find_elements_by_xpath('td')]
                for tr in driver.find_elements_by_xpath("//table[@class='table-class')]//tr"]
listofdict = [dict(zip(list_of_lists[0],row)) for row in list_of_lists[1:]]

提前致谢！

vham

Answer 1

如果您熟悉DOM（文档对象模型），那么您可以使用this帖子中的答案并使用BeautifulSoup库以DOM格式加载html。之后，您只需查找<tr>的实例，并在其中一个实例中找到所有相应的<td>标记。将DOM视为树结构，其中分支发生在嵌套标签上。

Answer 2

根据您尝试访问的网站，您可能不需要继续使用selenium。您只需使用requests访问html。

对于您提供的HTML，您可以使用BeautifulSoup提取表信息，如下所示：

from bs4 import BeautifulSoup

html = """
<table class="table-class">
  <thead>
    <tr>
      <th>Col 1</th>
      <th>Col 2</th>
      <th>Col 3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
     <td>Hello</td>
     <td>A number</td>
     <td>Another number<td>
   </tr>
   <tr>
     <td>there</td>
     <td>A number</td>
     <td>Another number<td>
   </tr>
  </tbody>
</table>"""

soup = BeautifulSoup(html, "html.parser")
rows = []

for tr in soup.find_all('tr'):
    cols = []
    for td in tr.find_all(['td', 'th']):
        td_text = td.get_text(strip=True)
        if len(td_text):
            cols.append(td_text)
    rows.append(cols)

print rows

给你rows持有：

[[u'Col 1', u'Col 2', u'Col 3'], [u'Hello', u'A number', u'Another number'], [u'there', u'A number', u'Another number']]

要使用requests，它会启动类似：

import requests            

response = requests.get(url)
html = response.text

如何使用python和selenium读取HTML表格单元格中的文本？

2 个答案: