Question

我无法使这个脚本能够从一系列维基百科文章中获取信息。

我尝试做的是迭代一系列wiki网址并提取维基门户网站类别的页面链接（例如https://en.wikipedia.org/wiki/Category:Electronic_design）。

我知道我所经历的所有维基页面都有一个页面链接部分。
但是当我尝试迭代它们时，我收到此错误消息：

Traceback (most recent call last):
  File "./wiki_parent.py", line 37, in <module>
    cleaned = pages.get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

为什么我会收到此错误？

我在第一部分中阅读的文件如下所示：

1 Category:Abrahamic_mythology
2 Category:Abstraction
3 Category:Academic_disciplines
4 Category:Activism
5 Category:Activists
6 Category:Actors
7 Category:Aerobics
8 Category:Aerospace_engineering
9 Category:Aesthetics

并将其存储在port_ID dict中，如下所示：

{1：＆＃39;类别：Abrahamic_mythology＆＃39;，2：＆＃39;类别：抽象＆＃39;，3：＆＃39;类别：Academic_disciplines＆＃39;，4：＆＃39;类别：Activism＆＃39;，5：＆＃39;类别：活动家＆＃39;，＆＃39;类别：演员＆＃39;，7：＆＃39;类别：健美操＆＃39;，8：＆＃39; ;类别：Aerospace_engineering＆＃39;，9：＆＃39;类别：美学＆＃39;，10：＆＃39;类别：不可知论＆＃39;，11：＆＃39;类别：农业＆＃39; ...}

所需的输出是：

parent_num, page_ID, page_num

我意识到代码有点hackish，但我只是想让它工作：

#!/usr/bin/env python
import os,re,nltk
from bs4 import BeautifulSoup
from urllib import urlopen
url = "https://en.wikipedia.org/wiki/"+'Category:Furniture'

rootdir = '/Users/joshuavaldez/Desktop/L1/en.wikipedia.org/wiki'

reg = re.compile('[\w]+:[\w]+')
number=1
port_ID = {}
for root,dirs,files in os.walk(rootdir):
    for file in files:
        if reg.match(file):
            port_ID[number]=file
            number+=1


test_file = open('test_file.csv', 'w')

for key, value in port_ID.iteritems():

    url = "https://en.wikipedia.org/wiki/"+str(value)
    raw = urlopen(url).read()
    soup=BeautifulSoup(raw)
    pages = soup.find("div" , { "id" : "mw-pages" })
    cleaned = pages.get_text()
    cleaned = cleaned.encode('utf-8')
    pages = cleaned.split('\n')
    pages = pages[4:-2]
    test = test = port_ID.items()[0]

    page_ID = 1
    for item in pages:
        test_file.write('%s %s %s\n' % (test[0],item,page_ID))
        page_ID+=1
    page_ID = 1

Answer 1

你正在循环中抓几页。但是可能有一些页面没有任何<div id="mw-pages">标记。所以你得到的是AttributeError，

cleaned = pages.get_text()

您可以使用if条件检查，如：

if pages:
    # do stuff

或者您可以使用try-except块来避免它，

try:
    cleaned = pages.get_text()
    # do stuff
except AttributeError as e:
    # do something

美丽的汤与维基百科

1 个答案: