我一直在为这段代码苦苦挣扎:
def MainPageSpider(max_pages):
page = 1
while page <= max_pages:
url = 'url' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = bs(plain_text, 'html.parser')
for link in soup.findAll(attrs={'class':'col4'}):
href = 'url' + link.a['href']
title = link.span.text
PostPageItems(href)
page += 1
def PostPageItems(post_url):
source_code = requests.get(post_url)
plain_text = source_code.text
soup = bs(plain_text, 'html.parser')
for items in soup.findAll(attrs={'class':'container'}):
title2 = items.find('h1', {'class':'title'}).get_text()
print(title2)
MainPageSpider(1)
每次我尝试从“ h1”获取文本时,都会出现此错误:
Traceback (most recent call last):
File "Xfeed.py", line 33, in <module>
MainPageSpider(1)
File "Xfeed.py", line 17, in MainPageSpider
PostPageItems(href)
File "Xfeed.py", line 27, in PostPageItems
test = title2.get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
但是当我在没有'get_text()'的情况下运行它时,我会得到'h1'HTML:
<h1 class="title">Title 1</h1>
None
None
None
None
<h1 class="title">Title 2</h1>
None
None
None
None
<h1 class="title">Title 3</h1>
None
None
None
None
我真的不明白为什么使用title = link.span.text
时出现此错误,我对获取文字没有任何问题。
我只想要文字。
答案 0 :(得分:0)
并非每个container
都有一个h1
,因此只需检查是否返回None
,然后仅打印是否返回即可。
for items in soup.findAll(attrs={'class':'container'}):
title2 = items.find('h1', {'class':'title'})
if title2:
print(title2.text)
答案 1 :(得分:0)
在没有get_text()
的输出中,看起来title2通常为None
,由于None
没有get_text()
属性,因此应该因您发布的错误而失败。您可以将其分为2条语句并添加如下所示的检查:
title2_item = items.find('h1', {'class':'title'})
if title2_item: # Check for None
title2 = title2_item.get_text()
print(title2)
答案 2 :(得分:0)
使用仅选择合格元素的css选择器进行重写
for item in soup.select('.container h1.title'):
title2 = item.text