如何使用漂亮的汤从html文档中获取<text>标签

时间:2019-06-26 05:42:27

标签: python html python-3.x beautifulsoup

如何使用漂亮的Abbot lab 10k filing汤从html文档中获取<text>标签

我想使用以下代码提取<text></text>标签的所有子标签的标签名称

from bs4 import BeautifulSoup
import urllib.request
url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'
htmlpage = urllib.request.urlopen(url)
soup = BeautifulSoup(htmlpage, "html.parser")
all_text = soup.find('text')
all_tags = all_text.contents
all_tags = [x.name for x in all_tags if x.name is not None]
print(all_tags)

但是我上面代码获得的输出是['html']

  

预期输出:
  ['p','p','p','p','p','p','div','div','font','font', etc......]

2 个答案:

答案 0 :(得分:1)

您可以使用CSS选择器(用于打印标签文本的所有子级):

for child in all_text.select('text *'):
    print(child.name, end=' ')

打印:

br p font font b p font b br p font b div div ...

编辑:对于仅打印标记文本的直接子代,可以使用:

from bs4 import BeautifulSoup
import requests

url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'

htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")

for child in soup.select('text > *'):
    print(child.name, end=' ')

答案 1 :(得分:0)

替换您的代码:

all_tags = all_text.contents
all_tags = [x.name for x in all_tags if x.name is not None]
print(all_tags)

收件人:

all_tags = [x.name for x in all_text.findChildren() if x.name is not None]
print(all_tags)

findChildren() more details