我正在尝试为web crawling
运行以下代码。
import requests
from bs4 import BeautifulSoup
def function1():
url = "http://www.iitg.ac.in/"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findALL('a', {'target' : '_blank'} ):
href = link.get(href)
print(href)
function1()
但它显示以下错误:
File "C:/Users/HP/.spyder-py3/temp.py", line 9, in function1
for link in soup.findALL('a', {'target' : '_blank'}):
TypeError: 'NoneType' object is not callable
我已经在这个平台上检查了它的解决方案,但据说在函数findALL
中似乎没有不可调用的对象。
请帮忙。
答案 0 :(得分:0)
此页面中可能没有a
元素与target="_blank"
。您可以设置if
,以确保只有在findAll
返回内容时才会进行迭代。
def function1():
url = "http://www.iitg.ac.in/"
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
links = soup.findAll('a', {'target' : '_blank'} )
if links:
for link in links:
href = link.get(href)
print(href)
else:
print('No links found in page!')
答案 1 :(得分:0)
“全部查找”功能区分大小写,使用soup.find_all()
或soup.findAll()
。
否则,soup.findALL
会返回None
然后您尝试调用它。
答案 2 :(得分:0)
您的网址对我不起作用。您是否要从URL打印所有HREF元素?请尝试下面的示例脚本。
import lxml.html
doc = lxml.html.parse('http://www.gpsbasecamp.com/national-parks')
links = doc.xpath('//a[@href]')
for link in links:
print(link.attrib['href'])
结果:
/contact-gpsbasecamp.php
/privacy-policy.php
/terms-of-service.php
/
National-Parks/map
/National-Historic-Parks
/National-Historic-Sites
/National-Monuments
/Other-NPS-Facilities
national-parks/Acadia_National_Park
national-parks/Arches_National_Park
national-parks/Badlands_National_Park
national-parks/Big_Bend_National_Park
national-parks/Biscayne_National_Park
national-parks/Black_Canyon_Of_The_Gunnison_National_Park
etc., etc., etc.