Question

我正在尝试学习如何使用BeautifulSoup进行屏幕刮擦。

from urllib import urlopen
from BeautifulSoup import BeautifulSoup
import re

webpage = urlopen('http://feeds.feedburner.com/zenhabits').read()

patFinderTitle = re.compile('<h4 class="itemtitle"><a href=(.*)</a></h4>')

findPatTitle = re.findall(patFinderTitle,webpage)
listIterator = []
listIterator[:] = range(1, 5)

for i in listIterator:
    print findPatTitle[i]
    print("\n")

错误

Traceback (most recent call last):
File "//da-srv1/users/xxxxx/Desktop/fetcher", line 14, in <module>
print findPatTitle[i]
**IndexError: list index out of range**

Answer 1

使用以下表达式：

patFinderTitle.findall(webpage)

由于re.findall(re.compile(<expression>), <string>)只接受正则表达式作为字符串，因此不能执行re.findall的等效操作 - re.compile(<expression>)返回已编译的正则表达式对象。因此，您需要使用已编译的正则表达式对象patFinderTitle并调用其findall()方法（参见上文）。

编辑：哦。结果证明你可以做re.findall(re.compile(<expression>), <string>)。你知道的越多。

Answer 2

你省略了read（）函数调用的括号，所以网页是一个函数，而不是一个字符串。

webpage = urlopen('http://feeds.feedburner.com/zenhabits').read()

TypeError：在网页文本上使用re.findall时的预期字符串 - 为什么？

2 个答案: