SO的长期用户,最近刚刚创建了一个帐户。这是我第二次在这里提问。我对Python相当陌生,但是具有编程经验并且对Web爬虫非常新。
问题
我编写了一个函数来下载一系列格式都非常相似的HTML文件。然后,我使用BeautifulSoup解析HTML文件,并最终将数据加载到SQL表中。我正在对列/表进行差距分析,我们已经看到有多少不同。我正在尝试读取某个HTML标签,在某些情况下,还有一组额外的空标签。我真正想做的就是简单地删除此多余的条目并继续。我尝试使用decompose()函数,还尝试通过索引引用值并执行删除操作。
<dt class="dlterm"></dt>
这将删除我的列,因为我稍后尝试将它们存储为记录时将列名,数据类型和描述存储。我不知道如何删除它并继续解析文件。
我可以让Python找到<dt class="dlterm"></dt>
并尝试了decompose()和pop()方法,我什至在考虑提出偏移量,并在找到偏移量时将其设置为1,然后以某种方式设置对于该循环迭代,将其余代码偏移1。
我已经开始工作的一种解决方案是,在尝试使用beautifulsoup读取此文件之前,通过打开源文件并替换<dt class="dlterm"></dt>
标签来完全解决此问题。向一个老同事借用一个术语,这是“麻烦的出路”。可以,但是对于一个简单的问题来说似乎很多代码。
问题
我以为汤对象是一个列表,但它不是那样的吗?汤对象的合适名词是什么?
Python代码段
# Load the cursor/recordset
myrecordset = mycursor.fetchall()
# Outer loop
for y in myrecordset:
myfilepath = "myexample.html" % y[2]
soup = BeautifulSoup(open(myfilepath),"html.parser")
PageName = soup.find("h1",{"class":"topictitle1"})
# print ("PageName: " + PageName.text)
FieldName = soup.find_all("dt", {"class":"dlterm"})
FieldDataType = soup.find_all("samp", {"class":"codeph"})
FieldDesc = soup.find_all("dd", {"class":"ddexpand"})
# outercounter = -1
#
# #Fix the empty value issue early that is offsetting everything
# for z in FieldName:
# outercounter+=1
# # FieldName[7].decompose()
# if z.text == '': # '<dt class="dlterm"></dt>':
# z.decompose()
#
# # FieldName[outercounter-1].pop()
# How to get get the description cleaned up
# FieldDesc[2].text.replace('\n','').replace(' ', ' ')
# print(FieldName.text)
# print(FieldDataType.text)
# print(FieldDesc.text)
# inner loop
innercounter1 = 0
# zip allows me to iterate through multiple lists at the same time
for (fn, fdt, fd) in zip(FieldName, FieldDataType, FieldDesc):
fntemp= ''
fdttemp= ''
fdtemp= ''
fntemp = fn.text
fdttemp = fdt.text
# clean the string
if fd.text.__contains__('One of:'):
# hold onto the double return while I replace the others.
fdtemp = fd.text.replace('\n\n', '<<nn>>')
fdtemp = fdtemp.replace('\n',', ')
fdtemp = fdtemp.replace('<<nn>>', '\n')
else:
fdtemp = fd.text.replace('\n', ' ')
fdtemp = fdtemp.strip()
# remove all redundant spaces from the string
fdtemp = " ".join(fdtemp.split())
# have to escape single quotes in text so it will insert correctly
fdtemp = fdtemp.replace("'", "''")
#Insert into SQL
# ... code continued
HTML文件中显示该问题的片段
<div class="section">
<h2 class="sectiontitle">Title</h2>
<dl>
<dt class="dlterm">Term1</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term1</dd>
<dt class="dlterm">Term2</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term2</dd>
<dt class="dlterm"></dt><dt class="dlterm">Term3</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah about term3</dd>
</dl></div>
如果有人可以帮助我解决这个问题,那就太好了。
答案 0 :(得分:1)
decompose()足以解决您的问题。
from bs4 import BeautifulSoup
html="""
<div class="section">
<h2 class="sectiontitle">Title</h2>
<dl>
<dt class="dlterm">Term1</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term1</dd>
<dt class="dlterm">Term2</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term2</dd>
<dt class="dlterm"></dt><dt class="dlterm">Term3</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah about term3</dd>
</dl></div>
"""
soup=BeautifulSoup(html,'html.parser')
for tag in soup.find_all('dt',attrs={"class":"dlterm"}): #all dl tags with class dlterm
if not tag.text: #if tag is empty
tag.decompose()
print(soup)
输出
<div class="section">
<h2 class="sectiontitle">Title</h2>
<dl>
<dt class="dlterm">Term1</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term1</dd>
<dt class="dlterm">Term2</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah blah about term2</dd>
<dt class="dlterm">Term3</dt><dd><samp class="codeph">nonNegativeInteger</samp></dd><dd class="ddexpand">Blah blah about term3</dd>
</dl></div>