这是我需要解析的HTML文档的结构(请参阅UPDATE 3):
<div id="txt_123_C01" style="position:absolute; left:5px; top:206px; width:532px; height:8912px;">
<div class="Normal-P1">
<span class="Normal-C2">Main title 1<br></span></div>
<div class="Normal-P1">
<span class="Normal-C3">Optional Subtitle<br></span></div>
<div class="Normal-P1">
<span class="Normal-C3">Second Optional Subtitle</span>
<span class="Normal-C4">Text blurb 1.<br></span></div>
<div class="Normal-P1">
<span class="Normal-C4">Text blurb 2.<br></span></div>
<div class="Normal-P1">
<span class="Normal-C4">Text blurb 4.<br></span></div>
<span class="Normal-C4"><br></span></div>
<div class="Normal-P1">
<span class="Normal-C3"><br></span></div>
<div class="Normal-P1">
<span class="Normal-C2">Main title 2<br></span></div>
<div class="Normal-P1">
<span class="Normal-C3">Subtitle 1</span>
<span class="Normal-C4"> Other text blurb 1.<br></span></div>
<div class="Normal-P1">
<span class="Normal-C4"> Other text blurb 2.<br></span></div>
我想生成一个如下所示的CSV文件:
Main Title Optional Subtitle 1 Optional Subtitle 2 Text Blurb
---------- ------------------- ------------------- ------------------------
Main title 1 Optional Subtitle Second Optional Subtitle Text blurb1. Textblurb2. Text blurb 4.
Main Title 2 Subtitle 1 Other text blurb 2.
我尝试过:
soup = BeautifulSoup(page,'xml')
divText = soup.find_all('div', {'class':'Normal-P1'})
for item in divText:
spanTitle = soup.find_all('span',{'class':'Normal-C2'})
spanOptopnal = soup.find_all('span',{'class':'Normal-C3'})
但是,这种方法不允许我分离Normal-P1
类,这样我就从C2
转到C4
,然后重新开始。 C3
和下一个C4
之间的C2
并不总是存在。在这种情况下,C4
是下一个C2
之前的最终标签。
我考虑过将所有div
放在一个列表中,然后根据C2
将它们分成子列表来处理它们。我试图找出使用bs4是否有更优雅的解决方案。
更新1
过一会儿再说。我只是使用以下答案查看了我的输出,然后看到了一个问题。
看着
titles = soup.select(".Normal-P1 .Normal-C2")
for entry in titles:
print "entry:",entry
parent = entry.parent
print "parent: ",parent
subtitles = [
subtitle.text for subtitle in
parent.select(' ~ .Normal-P1 .Normal-C3')
]
print "subtitles:",subtitles
我发现subtitles
包含来自父级(即所有titles
)外部的结果。输出看起来像这样:
entry: <span class="Normal-C2">Main title 1<br/></span>
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 1<br/></span></div>
subtitles: [Optional Subtitle,Second Optional Subtitle,Subtitle 1]
entry: <span class="Normal-C2">Main title 2<br/></span>
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 2<br/></span></div>
subtitles: [Subtitle 1]
更新2
parent.select(" ~ .Normal-P1 .Normal-C3")
似乎是造成此问题的原因。
问题在解决方案中提供的HTML中:<span class="Normal-C4"><br></span> </div>
。它缺少<div class="Normal-P1">
和结尾处的</div>
。在进行这些更改时,我在此示例HTML中也看到了相同的问题(文档中的所有字幕都显示为条目)。
我仔细检查了缩进,这对我来说没问题。我在做什么错?
更新3
这是完整的HTML
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Hello World</title>
</head>
<body>
<div id="txt_123_C01" style="position:absolute; left:5px; top:206px; width:532px; height:8912px;">
<div class="Normal-P1">
<span class="Normal-C2">Main title 1<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C3">Optional Subtitle<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C3">Second Optional Subtitle</span>
<span class="Normal-C4">Text blurb 1.<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C4">Text blurb 2.<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C4">Text blurb 4.<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C4"><br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C3"><br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C2">Main title 2<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C3">New Subtitle 1</span>
<span class="Normal-C4"> Other text blurb 1.<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C4"> Other text blurb 2.<br></span>
</div>
</div>
</body>
</html>
这是我看到的输出
entry: <span class="Normal-C2">Main title 1<br/></span>
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 1<br/></span>
</div>
subtitle: <span class="Normal-C3">Optional Subtitle<br/></span>
subtitle: <span class="Normal-C3">Second Optional Subtitle</span>
subtitle: <span class="Normal-C3"><br/></span>
subtitle: <span class="Normal-C3">New Subtitle 1</span>
entry: <span class="Normal-C2">Main title 2<br/></span>
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 2<br/></span>
</div>
subtitle: <span class="Normal-C3">New Subtitle 1</span>
这是我目前的代码:
file = filepath + "test-page.html"
parser = HTMLParser.HTMLParser()
pageFile = codecs.open(file, 'r', encoding='utf-8')
pageRaw = pageFile.read()
page = parser.unescape(pageRaw)
soup = bs4.BeautifulSoup(page,'lxml')
titles = soup.select(".Normal-P1 .Normal-C2")
for entry in titles:
print "entry:",entry
parent = entry.parent
print "parent: ",parent
for subtitle in parent.select(" ~ .Normal-P1 .Normal-C3"):
print "subtitle:", subtitle
答案 0 :(得分:1)
使用CSS选择器,您将要定位通过.Class-Name
和同级(通过ParentTag ~ .Child-Class
Mozillas MDN Web Docs上有一些不错的入门知识。
python文件:
import bs4
import csv
entries = []
with open("example.html", "r") as page:
soup = bs4.BeautifulSoup(page, 'lxml')
# CSS Selectors for items with class Normal-P1 followed by
# Normal-C2
titles = soup.select(".Normal-P1 .Normal-C2")
for entry in titles:
entry_dict = {
'Main Title': '',
'Optional Subtitle 1': '',
'Optional Subtitle 2': '',
'Text Blurb': ''
}
parent = entry.parent
entry_dict['Main Title'] = entry.text
subtitles = [
subtitle.text for subtitle in
parent.select(' ~ .Normal-P1 .Normal-C3')
# CSS Selector for siblings of the same parent element that have
# classes Normal-P1 followed by Normal-C3
]
try:
entry_dict['Optional Subtitle 1'] = subtitles[0]
entry_dict['Optional Subtitle 2'] = subtitles[1]
except IndexError:
pass
entry_dict['Text Blurb'] = ' '.join(
blurb.text for blurb in
parent.select(' ~ .Normal-P1 .Normal-C4')
# CSS Selector for siblings of the same parent element that have
# classes Normal-P1 followed by Normal-C4
)
entries.append(entry_dict)
with open('out.csv', 'w') as csv_file:
fieldnames = [
'Main Title',
'Optional Subtitle 1',
'Optional Subtitle 2',
'Text Blurb'
]
writer = csv.DictWriter(
csv_file,
fieldnames=fieldnames,
quoting=csv.QUOTE_ALL,
)
writer.writeheader()
for entry in entries:
writer.writerow(entry)
使用的html文件:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Hello World</title>
</head>
<body>
<div id="txt_123_C01" style="position:absolute; left:5px; top:206px; width:532px; height:8912px;">
<div class="Normal-P1">
<span class="Normal-C2">Main title 1<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C3">Optional Subtitle<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C3">Second Optional Subtitle</span>
<span class="Normal-C4">Text blurb 1.<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C4">Text blurb 2.<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C4">Text blurb 4.<br></span>
</div>
<span class="Normal-C4"><br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C3"><br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C2">Main title 2<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C3">Subtitle 1</span>
<span class="Normal-C4"> Other text blurb 1.<br></span>
</div>
<div class="Normal-P1">
<span class="Normal-C4"> Other text blurb 2.<br></span>
</div>
</body>
</html>
csv输出:
"Main Title","Optional Subtitle 1","Optional Subtitle 2","Text Blurb"
"Main title 1","Optional Subtitle","Second Optional Subtitle","Text blurb 1. Text blurb 2. Text blurb 4."
"Main title 2","Subtitle 1",""," Other text blurb 1. Other text blurb 2."
编辑: 我不确定您在使用HTMLParser做什么,但没有必要。 BeautifulSoup可以很好地读取文件。
import bs4
import codecs
with codecs.open("example.html", "r", encoding='utf-8') as page:
soup = bs4.BeautifulSoup(page, 'lxml')
# CSS Selectors for items with class Normal-P1 followed by
# Normal-C2
titles = soup.select(".Normal-P1 .Normal-C2")
for entry in titles:
print("entry: ", entry.text)
parent = entry.parent
print("parent:", parent)
subtitles = [
subtitle.text for subtitle in
parent.select(' ~ .Normal-P1 .Normal-C3')
# CSS Selector for siblings of the same parent element that have
# classes Normal-P1 followed by Normal-C3
]
try:
print("subtitle: ", subtitles[0])
print('subtitle: ', subtitles[1])
except IndexError:
pass
输出
entry: Main title 1
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 1<br/></span>
</div>
subtitle: Optional Subtitle
subtitle: Second Optional Subtitle
entry: Main title 2
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 2<br/></span>
</div>
subtitle: Subtitle 1