Question

我尝试从部分URL的现有.txt文件中获取每一行（每行一个），从每行的末尾删除％0A，为每个URL添加前缀以完成它，然后下载HTML文件对于我的硬盘驱动器的每个已完成的URL，以便稍后使用BeautifulSoup进行后续处理。

以下代码效果很好，除了两个问题：

1）每个下载的HTML文件都可以离线显示所有HTML数据（在文件上查看源文件时），但在Firefox中打开时不包含除页眉/横幅以外的任何可见数据，并且

2）脚本抛出“oidstripped [j] = str（offenderid [j]）IndexError：列表赋值索引超出范围”，每次运行时j = 51。它正确地下载j = 1到50的文件，但随后崩溃并且不会继续。

#snip#
j = 0
with open('offenderurls.txt') as r:
    offenderid = r.readlines()
    while j < len(offenderid):
       oidstripped = []
        for l in offenderid[j]:
          oidstripped.append(l)
       oidstripped[j] = str(offenderid[j])
       oidstripped[j] = oidstripped[j][:-1]
       res = requests.get('http://www.icrimewatch.net/' +  str(oidstripped[j]), stream=True)
       type(res)
       res.raise_for_status()
       with open('Offenderpage' + str(j) + '.html', 'wb') as playFile:
            for chunk in res.iter_content(1024):
                playFile.write(chunk)
            playFile.close()
    j = j + 1

请帮忙！我是python的新手。不需要温柔。厚脸皮。所有建议都将得到考虑和赞赏。

包含55个条目的示例offenderurls.txt位于：https://pastebin.ca/3886683

谢谢！

Answer 1

如果我理解你的目标，这应该会有所帮助。

您根本不需要for one_id in offenderid: res = requests.get('http://www.icrimewatch.net/' + one_id.rstrip('\n'), stream=True)，请尝试按照

new Promise

而不是while循环。

其余的我没有测试。

索引增加到大于50时获取IndexError

1 个答案: