使用BeautifulSoup

时间:2018-12-02 21:34:22

标签: python beautifulsoup

这是我需要解析的HTML文档的结构(请参阅UPDATE 3):

    <div id="txt_123_C01" style="position:absolute; left:5px; top:206px; width:532px; height:8912px;">
        <div class="Normal-P1">
        <span class="Normal-C2">Main title 1<br></span></div>
        <div class="Normal-P1">
        <span class="Normal-C3">Optional Subtitle<br></span></div>
        <div class="Normal-P1">
        <span class="Normal-C3">Second Optional Subtitle</span>
        <span class="Normal-C4">Text blurb 1.<br></span></div>
        <div class="Normal-P1">
        <span class="Normal-C4">Text blurb 2.<br></span></div>
        <div class="Normal-P1">
        <span class="Normal-C4">Text blurb 4.<br></span></div>
        <span class="Normal-C4"><br></span></div>
        <div class="Normal-P1">
        <span class="Normal-C3"><br></span></div>

    <div class="Normal-P1">
        <span class="Normal-C2">Main title 2<br></span></div>
        <div class="Normal-P1">
        <span class="Normal-C3">Subtitle 1</span>
        <span class="Normal-C4"> Other text blurb 1.<br></span></div>
        <div class="Normal-P1">
        <span class="Normal-C4"> Other text blurb 2.<br></span></div>

我想生成一个如下所示的CSV文件:

    Main Title     Optional Subtitle 1     Optional Subtitle 2        Text Blurb
    ----------     -------------------     -------------------       ------------------------     
    Main title 1   Optional Subtitle       Second Optional Subtitle   Text blurb1. Textblurb2. Text blurb 4.
    Main Title 2     Subtitle 1                                         Other text blurb 2.

我尝试过:

soup = BeautifulSoup(page,'xml')
divText = soup.find_all('div', {'class':'Normal-P1'})
for item in divText:
    spanTitle = soup.find_all('span',{'class':'Normal-C2'})
    spanOptopnal = soup.find_all('span',{'class':'Normal-C3'})

但是,这种方法不允许我分离Normal-P1类,这样我就从C2转到C4,然后重新开始。 C3和下一个C4之间的C2并不总是存在。在这种情况下,C4是下一个C2之前的最终标签。

我考虑过将所有div放在一个列表中,然后根据C2将它们分成子列表来处理它们。我试图找出使用bs4是否有更优雅的解决方案。

更新1

过一会儿再说。我只是使用以下答案查看了我的输出,然后看到了一个问题。

看着

   titles = soup.select(".Normal-P1 .Normal-C2")
   for entry in titles:
            print "entry:",entry
            parent = entry.parent
            print "parent: ",parent
            subtitles = [
                subtitle.text for subtitle in
                parent.select(' ~ .Normal-P1 .Normal-C3')
            ]
            print "subtitles:",subtitles

我发现subtitles包含来自父级(即所有titles)外部的结果。输出看起来像这样:

entry: <span class="Normal-C2">Main title 1<br/></span>
parent:  <div class="Normal-P1">
<span class="Normal-C2">Main title 1<br/></span></div>
subtitles: [Optional Subtitle,Second Optional Subtitle,Subtitle 1]


entry: <span class="Normal-C2">Main title 2<br/></span>
parent:  <div class="Normal-P1">
<span class="Normal-C2">Main title 2<br/></span></div>
subtitles: [Subtitle 1]

更新2

parent.select(" ~ .Normal-P1 .Normal-C3")似乎是造成此问题的原因。

问题在解决方案中提供的HTML中:<span class="Normal-C4"><br></span> </div>。它缺少<div class="Normal-P1">和结尾处的</div>。在进行这些更改时,我在此示例HTML中也看到了相同的问题(文档中的所有字幕都显示为条目)。

我仔细检查了缩进,这对我来说没问题。我在做什么错?

更新3

这是完整的HTML

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Hello World</title>
</head>
<body>
  <div id="txt_123_C01" style="position:absolute; left:5px; top:206px; width:532px; height:8912px;">
    <div class="Normal-P1">
      <span class="Normal-C2">Main title 1<br></span>
    </div>
    <div class="Normal-P1">
      <span class="Normal-C3">Optional Subtitle<br></span>
    </div>
    <div class="Normal-P1">
      <span class="Normal-C3">Second Optional Subtitle</span>
      <span class="Normal-C4">Text blurb 1.<br></span>
    </div>
    <div class="Normal-P1">
      <span class="Normal-C4">Text blurb 2.<br></span>
    </div>
    <div class="Normal-P1">
      <span class="Normal-C4">Text blurb 4.<br></span>
    </div>
    <div class="Normal-P1">
    <span class="Normal-C4"><br></span>
  </div>
  <div class="Normal-P1">
    <span class="Normal-C3"><br></span>
  </div>

  <div class="Normal-P1">
    <span class="Normal-C2">Main title 2<br></span>
  </div>
  <div class="Normal-P1">
    <span class="Normal-C3">New Subtitle 1</span>
    <span class="Normal-C4"> Other text blurb 1.<br></span>
  </div>
  <div class="Normal-P1">
    <span class="Normal-C4"> Other text blurb 2.<br></span>
  </div>
</div>
</body>
</html>

这是我看到的输出

    entry: <span class="Normal-C2">Main title 1<br/></span>
    parent:  <div class="Normal-P1">

<span class="Normal-C2">Main title 1<br/></span>

    </div>
    subtitle: <span class="Normal-C3">Optional Subtitle<br/></span>
    subtitle: <span class="Normal-C3">Second Optional Subtitle</span>
    subtitle: <span class="Normal-C3"><br/></span>
    subtitle: <span class="Normal-C3">New Subtitle 1</span>
    entry: <span class="Normal-C2">Main title 2<br/></span>
    parent:  <div class="Normal-P1">
    <span class="Normal-C2">Main title 2<br/></span>
    </div>
    subtitle: <span class="Normal-C3">New Subtitle 1</span>

这是我目前的代码:

file = filepath + "test-page.html"
parser = HTMLParser.HTMLParser()
pageFile = codecs.open(file, 'r', encoding='utf-8')
pageRaw = pageFile.read()
page = parser.unescape(pageRaw)

soup = bs4.BeautifulSoup(page,'lxml')
titles = soup.select(".Normal-P1 .Normal-C2")

for entry in titles:
    print "entry:",entry
    parent = entry.parent
    print "parent: ",parent

    for subtitle in parent.select(" ~ .Normal-P1 .Normal-C3"):
        print "subtitle:", subtitle

1 个答案:

答案 0 :(得分:1)

使用CSS选择器,您将要定位通过.Class-Name和同级(通过ParentTag ~ .Child-Class

进行操作)的类名。

Mozillas MDN Web Docs上有一些不错的入门知识。

python文件:

import bs4
import csv

entries = []

with open("example.html", "r") as page:
    soup = bs4.BeautifulSoup(page, 'lxml')

    # CSS Selectors for items with class Normal-P1 followed by
    # Normal-C2
    titles = soup.select(".Normal-P1 .Normal-C2")

    for entry in titles:
        entry_dict = {
            'Main Title': '',
            'Optional Subtitle 1': '',
            'Optional Subtitle 2': '',
            'Text Blurb': ''
        }
        parent = entry.parent

        entry_dict['Main Title'] = entry.text

        subtitles = [
            subtitle.text for subtitle in
            parent.select(' ~ .Normal-P1 .Normal-C3')
            # CSS Selector for siblings of the same parent element that have
            # classes Normal-P1 followed by Normal-C3
        ]
        try:
            entry_dict['Optional Subtitle 1'] = subtitles[0]
            entry_dict['Optional Subtitle 2'] = subtitles[1]
        except IndexError:
            pass

        entry_dict['Text Blurb'] = ' '.join(
            blurb.text for blurb in
            parent.select(' ~ .Normal-P1 .Normal-C4')
            # CSS Selector for siblings of the same parent element that have
            # classes Normal-P1 followed by Normal-C4
        )

        entries.append(entry_dict)

    with open('out.csv', 'w') as csv_file:
        fieldnames = [
            'Main Title',
            'Optional Subtitle 1',
            'Optional Subtitle 2',
            'Text Blurb'
        ]
        writer = csv.DictWriter(
            csv_file,
            fieldnames=fieldnames,
            quoting=csv.QUOTE_ALL,
        )
        writer.writeheader()
        for entry in entries:
            writer.writerow(entry)

使用的html文件:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Hello World</title>
</head>
<body>
  <div id="txt_123_C01" style="position:absolute; left:5px; top:206px; width:532px; height:8912px;">
    <div class="Normal-P1">
      <span class="Normal-C2">Main title 1<br></span>
    </div>
    <div class="Normal-P1">
      <span class="Normal-C3">Optional Subtitle<br></span>
    </div>
    <div class="Normal-P1">
      <span class="Normal-C3">Second Optional Subtitle</span>
      <span class="Normal-C4">Text blurb 1.<br></span>
    </div>
    <div class="Normal-P1">
      <span class="Normal-C4">Text blurb 2.<br></span>
    </div>
    <div class="Normal-P1">
      <span class="Normal-C4">Text blurb 4.<br></span>
    </div>
    <span class="Normal-C4"><br></span>
  </div>
  <div class="Normal-P1">
    <span class="Normal-C3"><br></span>
  </div>

  <div class="Normal-P1">
    <span class="Normal-C2">Main title 2<br></span>
  </div>
  <div class="Normal-P1">
    <span class="Normal-C3">Subtitle 1</span>
    <span class="Normal-C4"> Other text blurb 1.<br></span>
  </div>
  <div class="Normal-P1">
    <span class="Normal-C4"> Other text blurb 2.<br></span>
  </div>
</body>
</html>

csv输出:

"Main Title","Optional Subtitle 1","Optional Subtitle 2","Text Blurb"

"Main title 1","Optional Subtitle","Second Optional Subtitle","Text blurb 1. Text blurb 2. Text blurb 4."

"Main title 2","Subtitle 1",""," Other text blurb 1.  Other text blurb 2."

编辑: 我不确定您在使用HTMLParser做什么,但没有必要。 BeautifulSoup可以很好地读取文件。

import bs4
import codecs


with codecs.open("example.html", "r", encoding='utf-8') as page:
    soup = bs4.BeautifulSoup(page, 'lxml')

    # CSS Selectors for items with class Normal-P1 followed by
    # Normal-C2
    titles = soup.select(".Normal-P1 .Normal-C2")

    for entry in titles:
        print("entry: ", entry.text)
        parent = entry.parent
        print("parent:", parent)

        subtitles = [
            subtitle.text for subtitle in
            parent.select(' ~ .Normal-P1 .Normal-C3')
            # CSS Selector for siblings of the same parent element that have
            # classes Normal-P1 followed by Normal-C3
        ]
        try:
            print("subtitle: ", subtitles[0])
            print('subtitle: ', subtitles[1])
        except IndexError:
            pass

输出

entry:  Main title 1
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 1<br/></span>
</div>
subtitle:  Optional Subtitle
subtitle:  Second Optional Subtitle
entry:  Main title 2
parent: <div class="Normal-P1">
<span class="Normal-C2">Main title 2<br/></span>
</div>
subtitle:  Subtitle 1