我无法使这个脚本能够从一系列维基百科文章中获取信息。
我尝试做的是迭代一系列wiki网址并提取维基门户网站类别的页面链接(例如https://en.wikipedia.org/wiki/Category:Electronic_design)。
我知道我所经历的所有维基页面都有一个页面链接部分。
但是当我尝试迭代它们时,我收到此错误消息:
Traceback (most recent call last):
File "./wiki_parent.py", line 37, in <module>
cleaned = pages.get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'
为什么我会收到此错误?
我在第一部分中阅读的文件如下所示:
1 Category:Abrahamic_mythology
2 Category:Abstraction
3 Category:Academic_disciplines
4 Category:Activism
5 Category:Activists
6 Category:Actors
7 Category:Aerobics
8 Category:Aerospace_engineering
9 Category:Aesthetics
并将其存储在port_ID dict中,如下所示:
{1:&#39;类别:Abrahamic_mythology&#39;,2:&#39;类别:抽象&#39;,3:&#39;类别:Academic_disciplines&#39;,4:&#39;类别:Activism&#39;,5:&#39;类别:活动家&#39;,&#39;类别:演员&#39;,7:&#39;类别:健美操&#39;,8:&#39; ;类别:Aerospace_engineering&#39;,9:&#39;类别:美学&#39;,10:&#39;类别:不可知论&#39;,11:&#39;类别:农业&#39; ...}
所需的输出是:
parent_num, page_ID, page_num
我意识到代码有点hackish,但我只是想让它工作:
#!/usr/bin/env python
import os,re,nltk
from bs4 import BeautifulSoup
from urllib import urlopen
url = "https://en.wikipedia.org/wiki/"+'Category:Furniture'
rootdir = '/Users/joshuavaldez/Desktop/L1/en.wikipedia.org/wiki'
reg = re.compile('[\w]+:[\w]+')
number=1
port_ID = {}
for root,dirs,files in os.walk(rootdir):
for file in files:
if reg.match(file):
port_ID[number]=file
number+=1
test_file = open('test_file.csv', 'w')
for key, value in port_ID.iteritems():
url = "https://en.wikipedia.org/wiki/"+str(value)
raw = urlopen(url).read()
soup=BeautifulSoup(raw)
pages = soup.find("div" , { "id" : "mw-pages" })
cleaned = pages.get_text()
cleaned = cleaned.encode('utf-8')
pages = cleaned.split('\n')
pages = pages[4:-2]
test = test = port_ID.items()[0]
page_ID = 1
for item in pages:
test_file.write('%s %s %s\n' % (test[0],item,page_ID))
page_ID+=1
page_ID = 1
答案 0 :(得分:2)
你正在循环中抓几页。但是可能有一些页面没有任何<div id="mw-pages">
标记。所以你得到的是AttributeError
,
cleaned = pages.get_text()
您可以使用if
条件检查,如:
if pages:
# do stuff
或者您可以使用try-except
块来避免它,
try:
cleaned = pages.get_text()
# do stuff
except AttributeError as e:
# do something