Python使用BeautifulSoup从HTML解析表

时间:2019-07-18 21:00:35

标签: python python-3.x beautifulsoup

我正在尝试从多个html文件中获取表格。理想情况下,我在列表中有行和列,因此可以对其进行进一步处理。我是BeautifulSoup的新手,但无法正常工作。我认为主要问题是在函数返回None时发生的,因此无法进一步处理。我尝试了if语句,但这无济于事。我现在的代码:

from bs4 import BeautifulSoup
table_dict = {}
for filename, text in tqdm(lowercase_dict.items()):
    soup = BeautifulSoup(text, "lxml")
    table = soup.find('table')
    table_body = table.find('tbody')
    if table_body is not None:
        tables = table_body

    rows = tables.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])

    table_dict[filename] = cols
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-304-14ade2e7b2ac> in <module>()
      7         tables = table_body
      8 
----> 9     rows = tables.find_all('tr')
     10     for row in rows:
     11         cols = row.find_all('td')

AttributeError: 'str' object has no attribute 'find_all'

```

1 个答案:

答案 0 :(得分:0)

根据您的错误消息,问题在于变量 tables 是一个字符串。不使用“ tbody”即可尝试。

for filename, text in tqdm(lowercase_dict.items()):
    soup = BeautifulSoup(text, "lxml")
    table = soup.find('table')
    rows = table.find_all('tr')