Question

这篇文章包含几个问题。谢谢你看看。

我有一个包含列'URL'，'CIK'和'date'的数据框。这些网址是EDGAR网站的10-ks-CIKs是每个文件实体的唯一ID，如果您想知道的话。此数据框的一部分可以在csv here.

中找到

我想遍历每个网址，应用BeautifulSoup，并将每个网址保存到以CIK和日期命名的唯一文本文件。

到目前为止我的代码：

import urllib
from bs4 import BeautifulSoup
import pandas as pd
import numpy
import os

#x is a dataframe including columns 'url', 'cik' and 'date'
#convert x to tuple

subset = x[['url', 'cik', 'date']]
tuples = [tuple(x) for x in subset.values]

os.chdir("C:/10k/Python")

#goal: loop through each URL, run BS,
#write to .txt named with matching CIK and date element

for index, url in enumerate(tuples):

    fp = urllib.request.urlopen(tuples)
    test = fp.read()
    soup = BeautifulSoup(test,"lxml")
    output=soup.get_text()
    file=open("url%s.txt","w",encoding='utf-8')
    file.close()
    file.write(output)

有几个问题：

当我尝试从数据帧中编写循环时，我收到以下错误：

'系列'对象是可变的，因此无法进行散列

我相信这里的答案是转换为元组，我做了。这使它变得不可改变。但是现在我不确定在编写循环时如何引用元组的不同元素。

我尝试的下一步是使用enumerate循环遍历元组。我收到以下错误：

AttributeError：'list'对象没有属性'timeout'。

我相信这意味着循环试图完整地读取元组，而不是每个元素，但我不确定，也无法在论坛上找到一个好的答案。

最后，我不确定在将每个文件写入.txt时如何引用元组的元素。现在我有url％s，但这只会做URL1，URL2等。

循环访问URL，应用BeautifulSoup，保存以元组元素命名的文件

0 个答案: