BeautifulSoup将HTML解析为1行字符串

时间:2015-06-23 18:52:25

标签: python html web-scraping beautifulsoup

出于某种原因,当我使用beautifulsoup来解析HTML页面并将页面打印到txt文件时,它会取消格式化html并将其放在1行上。当我尝试使用正则表达式进行搜索时,它会找到一些内容,然后打印出该行,然而这会打印出整个页面,因为它全部为1行...我怎样才能让它停止这样做?

这是我的代码:

#!/usr/bin/python3

from bs4 import BeautifulSoup
import re
import urllib.request


def main():
    #Open the PID file and read the PID's
    URLList = []
    PID = [open("PID.txt").read().split()]
    for list in PID:
        for code in list:
            URLList.append("http://www.abb.com/productdetails/" + code)
    pageNo = 1
    for URL in URLList:
        fh = open("html.txt", "a")
        fh.write("\n\n\n\n\n")
        webPage = urllib.request.urlopen(URL)
        soup = BeautifulSoup(webPage.read())
        print("Page", pageNo, "retrieved")
        fh.write(str(soup.prettify().encode("utf-8")))
        pageNo += 1
    fh.close()
    output = open('html.txt', 'r')
    for line in output:
        line = line.rstrip()
        if re.search('NetDepth', line):
            print(line)


if __name__ == "__main__": main()

基本上,我需要它做的是打开UPC&PID / PID的文件,然后转到他们所在的网站并打开他们的页面......那部分工作正常。然后我想要将HTML全部放在txt文件中。从那里,我想搜索该文件的某些元素,如div标签或ProductNetDepth id。问题是,当它找到其中一个时,它会打印整个文档,因为它认为它是一行。我只是想要拥有它的HTML行。

以下是该网站源代码的一些内容:

        <div class="Dimensions pisEvenRow">

                                                                        <div id="ProductNetLength" class="detailPageLeftColumn">
                        Product Net Length:
                                  </div>

                    <div class="detailPageRightColumn">

                                    <div>68 mm</div>
                                                                                                  </div>
            </div>
        <div class="Dimensions pisOddRow">

                                                                        <div id="ProductNetDepth" title="Depth of a single unpacked product" class="detailPageLeftColumn">Product Net Depth:</div>

                    <div class="detailPageRightColumn">

                                    <div>67.5 mm</div>
                                                                                                  </div>
            </div>
        <div class="Dimensions pisEvenRowLast">

                                                                        <div id="ProductNetWeight" title="Weight of a single unpacked product" class="detailPageLeftColumn">Product Net Weight:</div>

                    <div class="detailPageRightColumn">

                                    <div>0.041 kg</div>
                                                                                                  </div>

以下是从beautifulsoup写入文件后的样子:

ijQoI5DAFDwZHYnHo-npjlC99WMTQ6qWYJ8fkDP8ddGyBe9DZa4IVC3j3aFtg7m85t7V9lKauOCgTq5CZ7cJneFTTH12Nx8mLxeKkAmLee2awza0rGQucVII-WdAyptFtKvKDBSLWhBUFTU7WLzD7DN4tAZzUEbQDGL2VHY5A0&amp;t=635706797508895128"/>\xc2\xa0Loading Images..\r\n                </div>\n</div>\n</div>\n<div class="pisDetailPageTitle">General Information</div>\n<div class="pisOddRow">\n<div class="detailPageLeftColumn">\n<span>Extended Product Type:\r\n      </span>\n</div>\n<div class="detailPageRightColumn">\r\n                                  E213-25-001\r\n                  </div>\n</div>\n<div class="pisEvenRow">\n<div class="detailPageLeftColumn">\n<span>Product ID:\r\n      </span>\n</div>\n<div class="detailPageRightColumn">\r\n                                  2CCA703041R0001\r\n                  </div>\n</div>\n<div class="pisOddRow">\n<div class="detailPageLeftColumn">\n<span>EAN:\r\n      </span>\n</div>\n<div class="detailPageRightColumn">\r\n                                  7612270938711\r\n                  </div>\n</div>\n<div class="pisEvenRow">\n<div class="detailPageLeftColumn">\n<span>Catalog Description:\r\n      </span>\n</div>\n<div class="detailPageRightColumn">\r\n                                  E213-25-10 Change over switch 25A 1CO 250VAC\r\n                  </div>\n</div>\n<div class="pisOddRowLast">\n<div class="detailPageLeftColumn">\n<span>Long Description:\r\n      </span>\n</div>\n<div class="detailPageRightColumn">\r\n                                  Change over switches according DIN EN 60669-1, VDE 0632 Part 1, Rated currents: 16/25 A, 250 VACPDC, Contacts: 1 CO/2 CO, Module width: 0,5/1\r\n                  </div>\n</div>\n<div class="pisDetailPageTitle">\r\n      Categories\r\n      </div>\n<div class="pisEvenRowLast" id="pisEvenRowLast">\n<ul class="pisCategoryList">\n<span>Products</span><span class="CategorySeperator">\xc2\xbb</span>\n<li>                      Low Voltage Products and Systems\r\n                      </li>\n<span class="CategorySeperator">\xc2\xbb</span>\n<li>                      Modular DIN Rail Products\r\n                      </li>\n<span class="CategorySeperator">\xc2\xbb</span>\n<li>                      Modular DIN Rail Components MDRCs\r\n                      </li>\n<span class="CategorySeperator">\xc2\xbb</span>\n<li>                      Command Devices\r\n                      </li>\n</ul>\n</div>\n<div class="displayNone" id="PisDiv_PlaceHolder1">\xc2\xa0</div>\n<div class="pisDetailPageTitle" id="Ordering">Ordering</div>\n<div class="Ordering pisOddRow">\n<div class="detailPageLeftColumn" id="Ean">\r\n                            EAN:\r\n                                      </div>\n<div class="detailPageRightColumn">\n<div>7612270938711</div>\n</div>\n</div>\n<div class="Ordering pisEvenRow">\n<div class="detailPageLeftColumn" id="MinimumOrderQuantity">\r\n                            Minimum Order Quantity:\r\n                                      </div>\n<div class="detailPageRightColumn">\n<div>10 piece</div>\n</div>\n</div>\n<div class="Ordering pisOddRowLast">\n<div class="detailPageLeftColumn" id="CustomsTariffNumber">\r\n                            Customs Tariff Number:\r\n    

如果你能提供帮助,那就太棒了......我已经尝试过从美化到试图将其分成几行的所有内容,但似乎没有任何工作正常。我希望它被格式化为源代码,以便我可以轻松搜索并从中获取我需要的项目!感谢您的帮助,如果可以,请不要给我一个答案,你能解释一下你做了什么!

2 个答案:

答案 0 :(得分:0)

我尝试使用这个简单的脚本来提取NetDepth,它运行正常。

from bs4 import BeautifulSoup as bs
from urllib import urlopen

soup = bs(urlopen('<insert url here>').read())
print soup.find(id="ProductNetDepth").next_sibling.next_sibling.div.text

如果你看一下html的结构,那么包含mm测量值的div就是div的兄弟,id为ProductNetDepth。所以我就是基于此。

如果你不熟悉汤的搜索功能,那么你应该看到他们的写得非常好documentation

答案 1 :(得分:0)

这里有一些不同的可能解决方案,但我会演示最简单的问题。

首先,我将回顾问题陈述和您的解决方案。

问题陈述:打印包含特定短语的所有请求HTML页面(在本例中为&#34; NetDepth&#34;)。

尝试解决方案:您正在使用urllib请求HTML文件,然后尝试使用BeautifulSoup来美化它,将其写入文本文件,然后最终打开文本文件并使用正则表达式来提取包含匹配的正则表达式的特定行。

在我看来,这个解决方案对于我们真正需要的东西来说有点苛刻。没有理由我们真的需要将HTML写入文件然后再从文件中读取它。我们可以在循环PID和发出HTTP请求时处理HTML的内容。另外,除了&#34;美化&#34;之外,我们并没有真正使用BeautifulSoup的基本功能,即解析特定标签的HTML(它确实令人惊讶,顺便说一下)。考虑到这两点,我们提出了解决方案......

建议的解决方案:使用requests获取HTML页面,逐行解析该页面的内容,并在每一行上运行正则表达式以查找符合条件的行。 / p>

<强>代码:

#!/usr/bin/python3
import re
import requests

def main():
    #Open the PID file and read the PID's
    URLList = []
    PID = [open("PID.txt").read().split()]
    for list in PID:
        for code in list:
            URLList.append("http://www.abb.com/productdetails/" + code)
    pageNo = 1

    for URL in URLList:
        response = requests.get(url=URL)

        for line in response.iter_lines():
            line = str(line.rstrip())
            if re.search('NetDepth', line):
                print(line)

记住PEP20,&#34;简单比复杂&#34;

更好