我正在尝试这样做:
req = urllib.request.Request("http://en.wikipedia.org/wiki/Philosophy")
content = urllib.request.urlopen(req).read()
soup = bs4.BeautifulSoup(content, "html.parser")
content = strip_brackets(soup.find('div', id="bodyContent").p)
for link in bs4.BeautifulSoup(content, "html.parser").findAll("a"):
print(link.get("href"))
如果我改为循环:
for link in soup.findAll("a"):
print(link.get("href"))
我不再收到错误,但我想首先删除内容的括号,然后获取内容的所有链接。
错误(第36行是for循环的行):
Traceback (most recent call last):
File "....py", line 36, in <module>
for link in bs4.BeautifulSoup(content, "html.parser").findAll("a"):
File "C:\Users\...\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\__init__.py", line 191, in __init__
markup = markup.read()
TypeError: 'NoneType' object is not callable
我做错了什么?
答案 0 :(得分:4)
你的最终目标是获取链接列表,对吧?
这会给你链接:
content = urlopen('http://en.wikipedia.org/wiki/Philosophy')
soup = BeautifulSoup(content, "html.parser")
base=soup.find('div', id="bodyContent")
for link in BeautifulSoup(str(base), "html.parser").findAll("a"):
if 'href' in link.attrs:
print(link['href'])
答案 1 :(得分:1)
你想剥离什么?你可以这样做。
from bs4 import BeautifulSoup as bs
import urllib
url = "http://en.wikipedia.org/wiki/Philosophy"
soup = bs(urllib.urlopen(url), "html.parser")
links = soup.find('div', id="bodyContent").p.findAll("a")
for link in links:
print link.get("href")
答案 2 :(得分:1)
我不明白你真正想要的东西。用你的代码
import urllib
import bs4
req = urllib.request.Request("http://en.wikipedia.org/wiki/Philosophy")
content = urllib.request.urlopen(req).read()
soup = bs4.BeautifulSoup(content, "html.parser")
for link in soup.findAll("a"):
print(link.get("href"))
https://zh.wikipedia.org/wiki/%E5%93%B2%E5%AD%A6
https://www.wikidata.org/wiki/Q5891#sitelinks-wikipedia
//en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
//creativecommons.org/licenses/by-sa/3.0/
//wikimediafoundation.org/wiki/Terms_of_Use
//wikimediafoundation.org/wiki/Privacy_policy
//www.wikimediafoundation.org/
https://wikimediafoundation.org/wiki/Privacy_policy
/wiki/Wikipedia:About
/wiki/Wikipedia:General_disclaimer
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://www.mediawiki.org/wiki/Special:MyLanguage/How_to_contribute
https://wikimediafoundation.org/wiki/Cookie_statement
//en.m.wikipedia.org/w/index.php?title=Philosophy&mobileaction=toggle_view_mobile
https://wikimediafoundation.org/
//www.mediawiki.org/
1847
With Dmitry's code
/wiki/Help:Category
/wiki/Category:Philosophy
/wiki/Category:CS1_maint:_Uses_editors_parameter
/wiki/Category:Pages_using_ISBN_magic_links
/wiki/Category:Wikipedia_indefinitely_semi-protected_pages
/wiki/Category:Use_dmy_dates_from_April_2016
/wiki/Category:Articles_containing_Ancient_Greek-language_text
/wiki/Category:Articles_containing_Sanskrit-language_text
/wiki/Category:All_articles_with_unsourced_statements
/wiki/Category:Articles_with_unsourced_statements_from_May_2016
/wiki/Category:Articles_containing_potentially_dated_statements_from_2016
/wiki/Category:All_articles_containing_potentially_dated_statements
/wiki/Category:Articles_with_DMOZ_links
/wiki/Category:Wikipedia_articles_with_LCCN_identifiers
/wiki/Category:Wikipedia_articles_with_GND_identifiers
1592
我已将这个命令用于两个程序
python s2.py | tee >(wc -l)
第二部分用于计算屏幕线。
答案 3 :(得分:0)
而不是for link in bs4.BeautifulSoup(content, "html.parser").findAll("a"):
尝试使用for link in content.findAll('a'):
无需重新解析content
。