Python和Beautifulsoup问题删除了汤对象中的空标签

时间:2019-01-11 20:41:25

标签: python web-scraping beautifulsoup

SO的长期用户,最近刚刚创建了一个帐户。这是我第二次在这里提问。我对Python相当陌生,但是具有编程经验并且对Web爬虫非常新。

问题

我编写了一个函数来下载一系列格式都非常相似的HTML文件。然后,我使用BeautifulSoup解析HTML文件,并最终将数据加载到SQL表中。我正在对列/表进行差距分析,我们已经看到有多少不同。我正在尝试读取某个HTML标签,在某些情况下,还有一组额外的空标签。我真正想做的就是简单地删除此多余的条目并继续。我尝试使用decompose()函数,还尝试通过索引引用值并执行删除操作。

  

<dt class="dlterm"></dt>

这将删除我的列,因为我稍后尝试将它们存储为记录时将列名,数据类型和描述存储。我不知道如何删除它并继续解析文件。

我可以让Python找到<dt class="dlterm"></dt>并尝试了decompose()和pop()方法,我什至在考虑提出偏移量,并在找到偏移量时将其设置为1,然后以某种方式设置对于该循环迭代,将其余代码偏移1。

我已经开始工作的一种解决方案是,在尝试使用beautifulsoup读取此文件之前,通过打开源文件并替换<dt class="dlterm"></dt>标签来完全解决此问题。向一个老同事借用一个术语,这是“麻烦的出路”。可以,但是对于一个简单的问题来说似乎很多代码。

问题

我以为汤对象是一个列表,但它不是那样的吗?汤对象的合适名词是什么?

Python代码段

# Load the cursor/recordset
myrecordset = mycursor.fetchall() 

# Outer loop
    for y in myrecordset:

        myfilepath = "myexample.html" % y[2]
        soup = BeautifulSoup(open(myfilepath),"html.parser")

        PageName = soup.find("h1",{"class":"topictitle1"})

        # print ("PageName: " + PageName.text)
            FieldName = soup.find_all("dt", {"class":"dlterm"})
            FieldDataType = soup.find_all("samp", {"class":"codeph"})
            FieldDesc = soup.find_all("dd", {"class":"ddexpand"})
            # outercounter = -1
            #
            # #Fix the empty value issue early that is offsetting everything
            # for z in FieldName:
            #     outercounter+=1
            #     # FieldName[7].decompose()
            #     if z.text == '': # '<dt class="dlterm"></dt>':
            #         z.decompose()
            #
            #         # FieldName[outercounter-1].pop()



            # How to get get the description cleaned up
            # FieldDesc[2].text.replace('\n','').replace('      ', ' ')
            # print(FieldName.text)
            # print(FieldDataType.text)
            # print(FieldDesc.text)

            # inner loop
            innercounter1 = 0
            # zip allows me to iterate through multiple lists at the same time
            for (fn, fdt, fd) in zip(FieldName, FieldDataType, FieldDesc):

                fntemp= ''
                fdttemp= ''
                fdtemp= ''

                fntemp = fn.text
                fdttemp = fdt.text

                # clean the string
                if fd.text.__contains__('One of:'):
                    # hold onto the double return while I replace the others.
                    fdtemp = fd.text.replace('\n\n', '<<nn>>')
                    fdtemp = fdtemp.replace('\n',', ')
                    fdtemp = fdtemp.replace('<<nn>>', '\n')
                else:
                    fdtemp = fd.text.replace('\n', ' ')

                fdtemp = fdtemp.strip()

                # remove all redundant spaces from the string
                fdtemp = " ".join(fdtemp.split())
                # have to escape single quotes in text so it will insert correctly
                fdtemp = fdtemp.replace("'", "''")

                #Insert into SQL

                # ... code continued

HTML文件中显示该问题的片段

<div class="section">
<h2 class="sectiontitle">Title</h2>
<dl>
<dt class="dlterm">Term1</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term1</dd>
<dt class="dlterm">Term2</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term2</dd>
<dt class="dlterm"></dt><dt class="dlterm">Term3</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah about term3</dd>
</dl></div>

如果有人可以帮助我解决这个问题,那就太好了。

1 个答案:

答案 0 :(得分:1)

decompose()足以解决您的问题。

from bs4 import BeautifulSoup
html="""
<div class="section">
<h2 class="sectiontitle">Title</h2>
<dl>
<dt class="dlterm">Term1</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term1</dd>
<dt class="dlterm">Term2</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term2</dd>
<dt class="dlterm"></dt><dt class="dlterm">Term3</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah about term3</dd>
</dl></div>
"""
soup=BeautifulSoup(html,'html.parser')
for tag in soup.find_all('dt',attrs={"class":"dlterm"}): #all dl tags with class dlterm
    if not tag.text: #if tag is empty
        tag.decompose()
print(soup)

输出

<div class="section">
<h2 class="sectiontitle">Title</h2>
<dl>
<dt class="dlterm">Term1</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term1</dd>
<dt class="dlterm">Term2</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term2</dd>
<dt class="dlterm">Term3</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah about term3</dd>
</dl></div>