Question

使用BeautifulSoup刮擦页面;试图过滤掉以“... html＃comments”结尾的链接

代码如下：

import urllib.request
import re
from bs4 import BeautifulSoup

base_url = "http://voices.washingtonpost.com/thefix/morning-fix/"
soup = BeautifulSoup(urllib.request.urlopen(base_url)).findAll('a')
links_to_follow = []
for i in soup:
        if i.has_key('href') and \
    re.search(base_url, i['href']) and \
    len(i['href']) > len(base_url) and \
    re.search(r'[^(comments)]', i['href']):
        print(i['href'])

Python 3.2，Windows 7 64位。

上述脚本保留以“#comments”

结尾的链接

我尝试了re.search([^comments], i['href'])，re.search([^(comments)], i['href'])和re.search([^'comments'], i['href']) - 都抛出了语法错误。

Python的新手，所以请求平庸。

我猜也是（a）我对'r'前缀不够了解，无法正确使用它（b）在响应[^（foo）]时，re.search不返回排除'foo'的行集，而是仅包含'foo'以上的行集。例如，我保留了我的...＃comments链接因为... texttexttext.html＃comments先于它或（c）Python将“＃”解释为结束re.search应匹配的行的注释。

我认为我错了（b）。

抱歉，知道这很简单。谢谢，

扎克

Answer 1

[^(comments)]

表示“一个字符既不是(也不是c，o，m，e，{{1} }，n，t或s“。可能不是你想要的。

如果您的目标是让正则表达式仅匹配所提供的字符串不在)中结束，那么我会使用

#comments

甚至更好（如果这么简单，为什么要使用正则表达式？）：

... and not re.search("#comments$", i['href'])

至于你的其他问题：

... and not i['href'].endswith("#comments")表示法允许您编写“原始字符串”，这意味着反斜杠不需要转义：

r'...'表示“反斜杠+ b”（将由正则表达式引擎解释为“字边界”
r'\b'表示“退格字符”
等

除非您使用'\b'或#选项，否则

(?x)在正则表达式中没有特殊含义。在这种情况下，它确实在多行正则表达式中开始注释。

Answer 2

正则表达式可能不是最好的解决方案：

import urllib.request
from bs4 import BeautifulSoup

base_url = "http://voices.washingtonpost.com/thefix/morning-fix/"
soup = BeautifulSoup(urllib.request.urlopen(base_url)).findAll('a')
links_to_follow = []
for i in soup:
    href = i.get('href')
    if href is None:
        continue
    if not href.startswith(base_url):
        continue
    if href.endswith('#comments'):
        print href

语法错误 - Python re.search（字符类，插入符号）

2 个答案: