如何使用bs4同时抓取多个页面?

时间:2018-12-06 10:01:50

标签: html python-3.x beautifulsoup praw

我想收集有关reddit的评论,我使用praw获取a2rp5i之类的文档的ID。例如,我已经收集了一组ID,例如

docArr=
['a14bfr', '9zlro3', 'a2pz6f', 'a2n60r', 'a0dlj3']
my_url = "https://old.reddit.com/r/Games/comments/a0dlj3/"
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
content_containers = page_soup.findAll("div", {"class":"md"})
timestamp_containers = page_soup.findAll("p", {"class":"tagline"})
time = timestamp_containers[0].time.get('datetime')

我想使用时间作为文件名,并将内容另存为txt文件

outfile = open('%s.txt' % time , "w") 
for content_container in content_containers:
    if content_container == "(self.games)":
        continue
    data = content_container.text.encode('utf8').decode('cp950', 'ignore')
    outfile.write(data)
outfile.close()

这种尝试对我来说只保存一个网址是好的 但是我想同时将ID保存在docArr

url_test = "https://old.reddit.com/r/Games/comments/{}/"
for i in set(docArr):
    url = url_test.format(i)

它使我的网址正确。但是如何将timecontent_container的所有URL一次保存在docArr中?

1 个答案:

答案 0 :(得分:0)

您只需要在当前代码中添加缩进

for i in docArr:
    url = url_test.format(i)
    uClient = uReq(url)
    ....
    ....
    outfile = open('%s.txt' % time , "w") 
    for content_container in content_containers:
        ....
        ....
    outfile.close()