维基百科用python解析 - 保存.edges文件中的节点

时间:2015-12-23 20:59:06

标签: python parsing python-3.x html-parsing wikipedia

所以我有这个python代码,我从维基百科页面获得1000个节点,三个深度,每页10个节点。

import urllib.request as urllib2
html = urllib2.urlopen('https://en.wikipedia.org/wiki/Computer_science').read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")

#first depth = list1

for link in soup.find_all('a', href=True, title=True)[:10]:
        print(link['href'])

    #second depth = list2
    sub_html = urllib2.urlopen('https://en.wikipedia.org' + link['href'])
    sub_soup = BeautifulSoup(sub_html, "lxml")
    for sub_link in sub_soup.find_all('a', href=True, title=True)[:10]:
        print(sub_link['href'])

        #third depth = list3
        sub_sub_html = urllib2.urlopen('https://en.wikipedia.org' + link['href'])
        sub_sub_soup = BeautifulSoup(sub_sub_html, "lxml")
        for sub2_link in sub_sub_soup.find_all('a', href=True, title=True)[:10]:
            print(sub2_link['href'])

接下来,我需要保存边文件中的所有节点。我的形式是:

“edge_from_list1”, “edge_from_list2”;

“edge_from_list1”, “edge_from_list2”;

.......

“edge_from_list2”, “edge_from_list3”

“edge_from_list2”, “edge_from_list3”

...

任何人都可以给我一个提示,我该怎么做?

0 个答案:

没有答案