使用漂亮的汤汁从<td>标签提取正确格式的文本(中间有空格)

时间:2019-07-02 09:05:13

标签: python html python-3.x beautifulsoup

我正在尝试从ABBV 10-k秒提交的表格之一中提取列标题(第25页上的“ 发行人购买股票证券”表- 在图表下方。)

在列标题<td>的{​​{1}}标记内,文本位于单独的<tr>标记中,如下面的示例

<div>

当尝试通过标签提取所有文本时,文本之间没有空格(例如,对于上述html输出,<tr> <td> <div>string1</div> <div>string2</div> <div>string3</div> </td> </tr> 是预期的string1string3string3)。

使用下面的代码从表

中提取列标题
string1 string3 string3
  

输出: url = 'https://www.sec.gov/Archives/edgar/data/1551152/000155115218000014/abbv-20171231x10k.htm' htmlpage = requests.get(url) soup = BeautifulSoup(htmlpage.text, "lxml") table = soup.find_all('table')[76] rows = table.find_all('tr') table_data = [] for tr in rows[2:3]: row_data=[] cells = tr.find_all(['td', 'th'], recursive=False) for cell in cells[1:4]: row_data.append(cell.text.encode('utf-8')) table_data.append([x.decode('utf-8').strip() for x in row_data]) print(table_data)

     

预期输出: [['(a) TotalNumberof Shares(or Units)Purchased', '', '(b) AveragePricePaid per Share(or Unit)']] (每个单词分隔一个空格)

1 个答案:

答案 0 :(得分:2)

separator参数与.get_text()一起使用:

html = '''<tr>
  <td>
    <div>string1</div>
    <div>string2</div>
    <div>string3</div>
  </td>
</tr>'''

import bs4

soup = bs4.BeautifulSoup(html, 'html.parser')

td = soup.find('td')
td.get_text(separator=' ')

这是您的代码的外观:

from bs4 import BeautifulSoup
import requests

url = 'https://www.sec.gov/Archives/edgar/data/1551152/000155115218000014/abbv-20171231x10k.htm'
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")
table = soup.find_all('table')[76]
rows = table.find_all('tr')
table_data = []
for tr in rows[2:3]:
    row_data=[]
    cells = tr.find_all(['td', 'th'], recursive=False)
    for cell in cells[1:4]:
        row_data.append(cell.get_text(separator=' ').encode('utf-8'))
    table_data.append([x.decode('utf-8').strip() for x in row_data])

print(table_data)

输出:

print(table_data)
[['(a) Total Number of Shares (or Units) Purchased', '', '(b) Average Price Paid per Share (or Unit)']]