Question

学习Python，我正在尝试制作一个没有任何第三方库的网络刮刀，这样我就不会简化这个过程了，我知道自己在做什么。我查看了几个在线资源，但所有这些都让我对某些事情感到困惑。

html看起来像这样，

<html>
<head>...</head>
<body>
    *lots of other <div> tags*
<div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal"">
<form class ="subform">...</form>
<div class = "subdiv1" >...</div>
<div class = "subdiv2" >...</div>
    *lots of other <div> tags*
</body>
</html>

我希望刮刀提取<div class = "want"...>*content*</div>并将其保存到html文件中。

我对如何解决这个问题有一个非常基本的想法。

import urllib
from urllib import request
#import re
#from html.parser import HTMLParser

response = urllib.request.urlopen("http://website.com")
html = response.read()

#Some how extract that wanted data

f = open('page.html', 'w')
f.write(data)
f.close()

Answer 1

标准库附带了各种Structured Markup Processing Tools，您可以使用它来解析HTML，然后搜索它以提取div。

那里有很多选择。你用什么？

html.parser看起来是明显的选择，但我实际上是从ElementTree开始。这是一个非常好用且非常强大的API，网上有大量的文档和示例代码可以帮助您入门，而且很多专家每天都会使用它来帮助您解决问题。如果事实证明etree无法解析您的HTML，您将不得不使用其他内容......但请先尝试。

例如，通过一些小的修复，你剪断了HTML，所以它实际上是有效的，所以实际上有一些文本值得你的div：

<html>
<head>...</head>
<body>
    *lots of other <div /> tags*
<div class = "want" style="font-family:verdana;font-size:12px;letter-spacing:normal">spam spam spam
<form class ="subform">...</form>
<div class = "subdiv1" >...</div>
<div class = "subdiv2" >...</div>
    *lots of other <div /> tags*
</div>
</body>
</html>

你可以使用这样的代码（我假设你知道，或者愿意学习，XPath）：

tree = ElementTree.fromstring(page)
mydiv = tree.find('.//div[@class="want"]')

现在，您已获得div类"want"的引用。您可以通过以下方式获取其直接文本：

print(mydiv.text)

但是如果你想提取整个子树，那就更容易了：

data = ElementTree.tostring(mydiv)

如果您想将其包装在有效的<html>和<body>中和/或删除<div>本身，则必须手动执行该部分。该文档说明了如何使用简单的树API构建元素：创建head和body以放入html，然后将div粘贴到{ {1}}，然后是body tostring，那就是它。

使用内置库在Python中创建基本的Web scraper - Python

1 个答案: