与BeautifulSoup结果集分开的元素

时间:2019-06-18 05:31:55

标签: python python-3.x web-scraping beautifulsoup

我正在使用Python(3.7)和BeautifulSoup(4)开发一个项目,在该项目中,我需要在不知道HTML确切结构的情况下剪贴一些数据,但是假设用户的相关信息将位于{{1 }}标签。在为这些标记headings, paragraph, pre and code之后,我要从ResultSet对象的find_all标记中分离headings and paragraph标记。

这是我尝试过的:

code and pre

但是它不会在required_tags = ["h1", "h2", "h3", "h4", "h5", "pre", "code", "p"] text_outputs = [] code_outputs = [] pages = [ "https://bugs.launchpad.net/bugs/1803780", "https://bugs.launchpad.net/bugs/1780224", "https://docs.openstack.org/keystone/pike/_modules/keystone/assignment/core.html", "https://openstack-news.blogspot.com/2018/11/bug-1803780-confusing-circular.html", "https://www.suse.com/documentation/suse-openstack-cloud-9/doc-cloud-upstream-user/user" "/html/keystone/_modules/keystone/assignment/core.html" ] page = requests.get(pages[0]) html_text = BeautifulSoup(page.text, 'html.parser') text = html_text.find_all(required_tags) elements = [] for e in html_text: elements.append(e.parent) for t in text: for e in elements: if e == 'code' or e == 'pre': print(e) code_outputs.append(t.get_text()) else: text_outputs.append(t.get_text()) code_outputs中返回任何内容。

谢谢!

3 个答案:

答案 0 :(得分:0)

只需从类似的元素中获取父名称

t.parent.name =='code'

而不是创建父元素列表。

答案 1 :(得分:0)

您没有得到任何数据,因为您迭代了不需要的额外内部for循环

 for e in elements:
     if e == 'code' or e == 'pre': 

请参见上述条件,您在子标签列表内迭代父标签以进行循环,并比较tag object with the string。您已经在text列表对象中获取了预标记数据。

for page in pages:
    res = requests.get(page)
    html_text = BeautifulSoup(res.text, 'html.parser')
    text = html_text.find_all(required_tags)   
    for t in text:
        if t.name == 'code' or t.name == 'pre':
            print("===if===")
            code_outputs.append(t.get_text())
        else:
            print("===else===")
            text_outputs.append(t.get_text())

print(code_outputs)
print(text_outputs)

更新

json_data = []
for page in pages:
    res = requests.get(page)
    html_text = BeautifulSoup(res.text, 'html.parser')
    text = html_text.find_all(required_tags)
    for t in text:
        if t.name == 'code' or t.name == 'pre':
            code_outputs.append(t.get_text())
        else:
            text_outputs.append(t.get_text())

    data = {page:{"html":text,"code_outputs":code_outputs,"text_outputs":text_outputs}}
    json_data.append(data)

print(json_data)

答案 2 :(得分:0)

您可以尝试以下方法:

from bs4 import BeautifulSoup

required_tags = ["h1", "h2", "h3", "h4", "h5", "pre", "code", "p"]
text_outputs = []
code_outputs = []
pages = [
        "https://bugs.launchpad.net/bugs/1803780",
        "https://bugs.launchpad.net/bugs/1780224",
        "https://docs.openstack.org/keystone/pike/_modules/keystone/assignment/core.html",
        "https://openstack-news.blogspot.com/2018/11/bug-1803780-confusing-circular.html",
        "https://www.suse.com/documentation/suse-openstack-cloud-9/doc-cloud-upstream-user/user"
        "/html/keystone/_modules/keystone/assignment/core.html"
    ]


page = requests.get(pages[2], verify=False)


html_text = BeautifulSoup(page.text, 'html.parser')
elements = {}


for tag in required_tags:
    data=list(html_text.find_all(tag))
    data = [dat.text for dat in data]
    if tag == "code" or tag=="pre":
        code_outputs+=data
    else:
        text_outputs+=data