Question

（环境：Python 2.7 + BeautifulSoup 4.3.2）

我正在使用Python和BeautifulSoup来获取此网页及其后续页面上的新闻标题。我不知道如何让它自动跟随后续/下一页，所以我将所有的URL放在一个文本文件web list.txt中。

http://www.legaldaily.com.cn/locality/node_32245.htm
http://www.legaldaily.com.cn/locality/node_32245_2.htm
http://www.legaldaily.com.cn/locality/node_32245_3.htm

。。

这是我到目前为止所做的工作：

from bs4 import BeautifulSoup
import re
import urllib2
import urllib


list_open = open("web list.txt")
read_list = list_open.read()
line_in_list = read_list.split("\n")


i = 0
while i < len(line_in_list):
    soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html')
    news_list = soup.find_all(attrs={'class': "f14 blue001"})
    for news in news_list:
        print news.getText()
i + = 1

它会弹出一条错误消息，说明语法无效。

出了什么问题？

Answer 1

i + = 1

这是无效的语法。

如果您想使用扩充赋值运算符+=，则加号和等号之间不能有空格。

i += 1

您将收到的下一个错误是：

NameError: name 'url' is not defined

因为在url行中使用它之前从未定义soup =。您可以通过直接在网址列表上进行迭代来解决此问题，而不是根本增加i。

for url in line_in_list:
    soup = BeautifulSoup(urllib2.urlopen(url).read(), 'html')
    news_list = soup.find_all(attrs={'class': "f14 blue001"})
    for news in news_list:
        print news.getText()

从多个网页中提取文本（文本文件中的URL）

1 个答案: