如何使用漂亮的Abbot lab 10k filing汤从html文档中获取<text>
标签
我想使用以下代码提取<text></text>
标签的所有子标签的标签名称
from bs4 import BeautifulSoup
import urllib.request
url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'
htmlpage = urllib.request.urlopen(url)
soup = BeautifulSoup(htmlpage, "html.parser")
all_text = soup.find('text')
all_tags = all_text.contents
all_tags = [x.name for x in all_tags if x.name is not None]
print(all_tags)
但是我上面代码获得的输出是['html']
。
预期输出:
['p','p','p','p','p','p','div','div','font','font', etc......]
答案 0 :(得分:1)
您可以使用CSS选择器(用于打印标签文本的所有子级):
for child in all_text.select('text *'):
print(child.name, end=' ')
打印:
br p font font b p font b br p font b div div ...
编辑:对于仅打印标记文本的直接子代,可以使用:
from bs4 import BeautifulSoup
import requests
url ='https://www.sec.gov/Archives/edgar/data/1800/000104746919000624/a2237733z10-k.htm'
htmlpage = requests.get(url)
soup = BeautifulSoup(htmlpage.text, "lxml")
for child in soup.select('text > *'):
print(child.name, end=' ')
答案 1 :(得分:0)
替换您的代码:
all_tags = all_text.contents
all_tags = [x.name for x in all_tags if x.name is not None]
print(all_tags)
收件人:
all_tags = [x.name for x in all_text.findChildren() if x.name is not None]
print(all_tags)