您好......我正在使用Python和BeautifulSoup来浏览公司的网页,验证项目的有效性。
脚本如下:
from bs4 import BeautifulSoup
import urllib2
import xlwt
pages = [36523,25658,85263,55215]
for page in pages:
url = "http://company.com/" + page
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
page_title = soup.find_all("title")
print page_title
输出是:
[<title>Nil</title>]
[<title>Item details</title>]
[<title>Nil</title>]
[<title>Item details</title>]
有些项目不存在,页面标题显示为Nil。我想在输出中排除这些Nil,所以我累了:
If len(page_title) == 20:
Pass
If len(page_title) == 20:
Continue
If page_title == ‘[<title>Nil</title>]’:
Continue # or Pass
但都没有成功,我没有朝着正确的方向前进。那么我怎样才能在结果中显示Nil?
感谢。
page_title = soup.find_all("title")
for each_page in page_title:
err_msg = soup.find_all(text="Nil")
if len(err_msg) == 0:
print each_page
答案 0 :(得分:1)
你需要计算page_title
的长度,但实际上你应该计算page_title
如果page_title = ['<title>Nil</title>']
。然后len(page_title)=1
包含一个元素,但len(page_title[0]) = 20
因此,你基本上应该做的是
for page in pages:
url = "http://company.com/" + page
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
page_title = soup.find_all("title") #This will return a list of titles
for title in page_title:
if title != "<title>Nil</title>":
print title
答案 1 :(得分:0)
更改此行:
page_title = soup.find_all("title")
为:
page_title = (title for title in soup.find_all("title") if "Nil" not in title)