Question

所以我有这个python代码，我从维基百科页面获得1000个节点，三个深度，每页10个节点。

import urllib.request as urllib2
html = urllib2.urlopen('https://en.wikipedia.org/wiki/Computer_science').read()
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")

#first depth = list1
for link in soup.find_all('a', href=True, title=True)[:10]:
        print(link['href'])

        #second depth = list2
        sub_html = urllib2.urlopen('https://en.wikipedia.org' + link['href'])
        sub_soup = BeautifulSoup(sub_html, "lxml")
        for sub_link in sub_soup.find_all('a', href=True, title=True)[:10]:
            print(sub_link['href'])

            #third depth = list3
            sub_sub_html = urllib2.urlopen('https://en.wikipedia.org' + link['href'])
            sub_sub_soup = BeautifulSoup(sub_sub_html, "lxml")
            for sub2_link in sub_sub_soup.find_all('a', href=True, title=True)[:10]:
                print(sub2_link['href'])

接下来，我需要保存边文件中的所有节点。我的形式是：

“edge_from_list1”， “edge_from_list2”;

.......

“edge_from_list2”， “edge_from_list3”

...

任何人都可以给我一个提示，我该怎么做？

Answer 1

我认为你在这里重新发明了 ~~wheel~~ 网络爬虫。像Scrapy或PySpider这样的工具可以让它变得更简单，更快捷。此外，数据导出功能内置于这些工具。例如，请参阅Scrapy中的Item Exporters。

如果您仍希望继续BeautifulSoup和urllib，则应csv.writer引用csv.QUOTE_ALL。

python3维基百科解析保存边文件中的节点

1 个答案: