Question

在循环浏览许多网页的过程中，我会通过网络抓取来获取一些信息：

我正在考虑构建一个csv，类似这样：

fieldnames = ['id', 'variable1', 'variable2']

f= open('file.csv', 'w', newline='')
my_writer = csv.DictWriter(f, fieldnames)
my_writer.writeheader()

for webpage in webpages:

  something where I get the information and put it in a dictionary mydict.
  Example mydict={'id':1, 'variable1':200, 'variable2':300}   

   writer.writerow(mydict)

f.close()

问题是每个网页中可能有不同数量的变量，因此我需要对此进行修改。

我想到的另一种选择是创建一个字典列表，最后将其转换为数据帧和csv：

finalist =[]
for webpage in webpages:

    something where I get the information and put it in a dictionary mydict.
    Example mydict={'id':1, 'variable1':200, 'variable2':300}     

    mylist =[mydict]
    finalist.extend(mylist)

df = pd.DataFrame(mylist)
df.to_csv()

这是一个非常长的循环所以会有很多行，所以两者中哪一个更有效？还是有另一种比这两种更有效的方法？另外，我应该保留一个json文件或csv或任何其他格式来存储数据，以便在R或任何其他程序中使用后者？

存储和构建数据python的最有效方法

0 个答案: