Question

抱歉，我是新的HTML，请理解，但我的问题很简单。

我想使用python构建简单的搜索引擎。

首先，我需要构建一个爬虫来获取链接的URL。

我希望使用正则表达式来提取链接的URL。

所以我做了学习，但我不知道HTML中链接的确切模式。

from urllib import urlopen
import re

webPage = urlopen('http://web.cs.dartmouth.edu/').read()
linkedPage = re.findall(r'what should be filled in here?', webPage)

Answer 1

有专门用于解析HTML的工具 - 这些工具称为HTML Parsers。

示例，使用BeautifulSoup：

from urllib2 import urlopen
from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen('http://web.cs.dartmouth.edu/'))
for article in soup.select('div.view-content article'):
    print article.text

打印页面上的所有文章：

Prof Sean Smith receives best paper of 2014 award
...
Lorenzo Torresani wins the Google Faculty Research Award
...

另请参阅使用正则表达式解析HTML的原因：

RegEx match open tags except XHTML self-contained tags

通过python正则表达式抓取网页

1 个答案: