Question

嘿，我刚刚开始学习python，想要编写网页抓取代码。我得到的网站，我只关心名称和价格，但所有信息都写在一个div中，其中包含其他三个子div。 HTML看起来像：

<div class="Product">
    <div class="product-image-and-name-container"></div>
    <div class="prices"></div>
    <div class="buy-now-button"></div>
</div>

我使用此行来获取＆＃34; Product＆＃34;

中的所有信息

 root_pattern = '<div class="Product">([\s\S]*?)</div>'

但只给第一个div - ＆＃34; product-image-and-name-container＆＃34;信息，然后停止。没有从其他div获得任何东西。

这是我的所有代码：

from urllib.request import Request, urlopen
import re

class Shopping_Spider():
    url = 'http://www....com/Shop-Online/587'
    root_pattern = '<div class="Product">([\s\S]*?)</div>'
    name_pattern = '<div class="product-name">([\s\S]*?)</div>'
    price_pattern = '<span class="Price">([\s\S]*?)</span>'

    def __fetch_content(self):
        # page = urllib.urlopen(Shopping_Spider.url)
        r = Request(Shopping_Spider.url, headers={'User-Agent': 'Mozilla/5.0'})
        html_s = urlopen(r).read()
        html_s = str(html_s, encoding='utf-8')
        return html_s

    def __analysis(self, html_s):
        root_html = re.findall(Shopping_Spider.root_pattern, html_s)

        anchors = []

        for html in root_html:
            name = re.findall(Shopping_Spider.name_pattern, html)
            price = re.findall(Shopping_Spider.price_pattern, html)
            anchor = {'name': name, 'price': price}
            anchors.append(anchor)

        return anchors

    def go(self):
        html_s = self.__fetch_content()
        self.__analysis(html_s)


shopping_spider = Shopping_Spider()
shopping_spider.go()

在此先感谢，我认为我的常规快递是错误的，但不知道如何重写，我知道可能更容易使用BeautifulSoup来处理它，但只是想知道是否可能我只是使用常规快递来得到我想要的！非常感谢。

Answer 1

您可以使用正则表达式

提取外部div的内部内容

root_pattern = r'(?:<div class="Product">)(.*)(?:</div>)'

上面你定义了三个捕获组，但是拒绝那些从开头<？p>开始指定的捕获组

但是你必须设置dotall标志以在点（。）中包含\ n char，即所有字符，规范，所以稍后在你的代码中使用

root_html = re.findall(Shopping_Spider.root_pattern, html_s, re.DOTALL)

然后你可以用相同的原则调整剩余的模式。

编辑（重要）：

Jimmy，除非正则表达式专家出现解决方案，否则请使用BeautifulSoup。

除非你的目标html页面那么简单（它们从未如此），否则这在实践中是行不通的（尽管使用了样本）.Alex注释正确到位。也可以随意取消我的答案，并给他学分，因为我倾向于相信更好的建议将始终与BS（尽管你要求正则表达式替代）。如果您认为它在某种程度上有用，您可能总是赞成这一点。

问题是div标签可以任意嵌套在文档中。我提出的正则表达式捕获了开始产品div之间的所有内容，直到文档中的最后一个div（在实践中不起作用）。这是因为*是贪婪的＆＃34;。你可以避免吗？跟随*，但你不会解决任何问题，因为它将捕获到第一个结束div。此外，我认为没有办法将结束div与其开头匹配，因为关闭div都是相同的，并且由于任意嵌套或结构更改，因为在产品div中有更多列出的div。

并非没有开始编写代码以某种方式解析html，这正是BS的用途。

定期快递从div获取所有div（包括信息）？

1 个答案: