我正在使用Python(3.7)和BeautifulSoup(4)开发一个项目,在该项目中,我需要在不知道HTML确切结构的情况下剪贴一些数据,但是假设用户的相关信息将位于{{1 }}标签。在为这些标记headings, paragraph, pre and code
之后,我要从ResultSet对象的find_all
标记中分离headings and paragraph
标记。
这是我尝试过的:
code and pre
但是它不会在required_tags = ["h1", "h2", "h3", "h4", "h5", "pre", "code", "p"]
text_outputs = []
code_outputs = []
pages = [
"https://bugs.launchpad.net/bugs/1803780",
"https://bugs.launchpad.net/bugs/1780224",
"https://docs.openstack.org/keystone/pike/_modules/keystone/assignment/core.html",
"https://openstack-news.blogspot.com/2018/11/bug-1803780-confusing-circular.html",
"https://www.suse.com/documentation/suse-openstack-cloud-9/doc-cloud-upstream-user/user"
"/html/keystone/_modules/keystone/assignment/core.html"
]
page = requests.get(pages[0])
html_text = BeautifulSoup(page.text, 'html.parser')
text = html_text.find_all(required_tags)
elements = []
for e in html_text:
elements.append(e.parent)
for t in text:
for e in elements:
if e == 'code' or e == 'pre':
print(e)
code_outputs.append(t.get_text())
else:
text_outputs.append(t.get_text())
和code_outputs
中返回任何内容。
谢谢!
答案 0 :(得分:0)
只需从类似的元素中获取父名称
t.parent.name =='code'
而不是创建父元素列表。
答案 1 :(得分:0)
您没有得到任何数据,因为您迭代了不需要的额外内部for循环
for e in elements:
if e == 'code' or e == 'pre':
请参见上述条件,您在子标签列表内迭代父标签以进行循环,并比较tag object with the string
。您已经在text
列表对象中获取了预标记数据。
for page in pages:
res = requests.get(page)
html_text = BeautifulSoup(res.text, 'html.parser')
text = html_text.find_all(required_tags)
for t in text:
if t.name == 'code' or t.name == 'pre':
print("===if===")
code_outputs.append(t.get_text())
else:
print("===else===")
text_outputs.append(t.get_text())
print(code_outputs)
print(text_outputs)
更新:
json_data = []
for page in pages:
res = requests.get(page)
html_text = BeautifulSoup(res.text, 'html.parser')
text = html_text.find_all(required_tags)
for t in text:
if t.name == 'code' or t.name == 'pre':
code_outputs.append(t.get_text())
else:
text_outputs.append(t.get_text())
data = {page:{"html":text,"code_outputs":code_outputs,"text_outputs":text_outputs}}
json_data.append(data)
print(json_data)
答案 2 :(得分:0)
您可以尝试以下方法:
from bs4 import BeautifulSoup
required_tags = ["h1", "h2", "h3", "h4", "h5", "pre", "code", "p"]
text_outputs = []
code_outputs = []
pages = [
"https://bugs.launchpad.net/bugs/1803780",
"https://bugs.launchpad.net/bugs/1780224",
"https://docs.openstack.org/keystone/pike/_modules/keystone/assignment/core.html",
"https://openstack-news.blogspot.com/2018/11/bug-1803780-confusing-circular.html",
"https://www.suse.com/documentation/suse-openstack-cloud-9/doc-cloud-upstream-user/user"
"/html/keystone/_modules/keystone/assignment/core.html"
]
page = requests.get(pages[2], verify=False)
html_text = BeautifulSoup(page.text, 'html.parser')
elements = {}
for tag in required_tags:
data=list(html_text.find_all(tag))
data = [dat.text for dat in data]
if tag == "code" or tag=="pre":
code_outputs+=data
else:
text_outputs+=data