Question

我目前正在学习Python，而且我正在尝试制作一个小型的刮刀，但我正在遇到Beautiful Soup和regex的问题。

我正在尝试匹配具有以下html的网站中的所有链接：

<td>
    <a href="/l1234/Place+Number+1">Place Number 1 </a>
</td>
<td width="100">
    California  </td>
<td>
    <a href="/l2342/Place+Number+2">Place Number 2 </a>
</td>
<td width="100">
    Florida </td>

我希望获得以下所有链接：“/ lxxxx / Place + Number + x”

我正在使用python和beautifulsoup：

import BeautifulSoup
import urllib2
import re

address = 'http://www.example.com'

html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)

for tag in soup.findAll('a', id = re.compile('l[0-9]*')):
    print tag['href']

我在一些示例代码中发现的汤.findAll中的正则表达式部分，因为我似乎无法从BeautifulSoup文档中获取示例。没有正则表达式部分，我得到了页面上的所有链接，但是我只想要“lxxx”的

我的正则表达式出了什么问题？也许有一种方法可以做这个没有正则表达式，但我似乎找不到方法。

Answer 1

您不应该尝试在href上而不是id上进行正则表达式匹配吗？

for tag in soup.findAll('a', href = re.compile('l[0-9]*')):
    print tag['href']

Answer 2

我建议

for tag in soup.findAll('a', href = re.compile('^/l[0-9]+/.*$')):
    print tag['href']

用于避免标签看起来像你看起来像什么

Answer 3

除了检查href not id

re.compile(r'^\/l[0-9]{4}/Place\+Number\+[0-9]+')

匹配似乎假设你的正则表达式以“^”开头。

>>> m = re.compile(r"abc")
>>> m.match("eabc")
>>> m.match("abcd")
<_sre.SRE_Match object at 0x7f23192318b8>

因此添加\ /允许第一个斜杠匹配。此外，我正在使用{4}匹配四个数字而不是*匹配零个或多个数字。

>>> m = re.compile(r'\/l[0-9]*')
>>> m.match("/longurl/somewhere")
<_sre.SRE_Match object at 0x7f2319231850>

正则表达式匹配的问题

3 个答案: