如何重塑由网络抓取创建的列表?

时间:2021-05-29 15:46:20

标签: python python-3.x dataframe reshape

我将下面的代码拼凑在一起。

import requests
from bs4 import BeautifulSoup
from pandas import DataFrame
import itertools
import numpy as np

url_base = "https://finviz.com/quote.ashx?t="

tckr = ['MSFT','AAPL','AMZN']
        
i = 1

url_list = [(s, url_base + s) for s in tckr]
data_list = []

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

for t, url in url_list:
    print(i)
    i = i + 1
    print(t, url)
    print('Scrapping ticker {}...'.format(t))
    soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
    #writer.writerow([t])
    for row in soup.select('.snapshot-table2 tr'):
        data_list.append([td.text for td in row.select('td')])

df = DataFrame(data_list)

因此,最终数据帧的形状为 36,12;它包含 3 个形状,每个形状为 12,12。我猜下面这行代码发生了一些事情,但我并不完全理解。

data_list.append([td.text for td in row.select('td')])

现在,我的数据看起来像这样。

enter image description here

不知何故,我希望每个代码有 72 列数据,因此最终结果如下所示。

enter image description here

1 个答案:

答案 0 :(得分:1)

您需要将每个代码的所有数据子列表存储到它自己的列表中。而不是将它们全部混合。然后,您可以使用 itertools chain.from_iterable 为每张票制作一个大列表,将每个偶数项作为键,奇数项作为字典中的值,并将每个行情的最终字典放入更大的列表中列表。这可以变成一个数据框。

import requests
from bs4 import BeautifulSoup
from pandas import DataFrame
import itertools
import numpy as np
from itertools import chain

url_base = "https://finviz.com/quote.ashx?t="

tckr = ['MSFT','AAPL','AMZN']
        
i = 1

url_list = [(s, url_base + s) for s in tckr]
data_list = []

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}

for t, url in url_list:
    print(i)
    i = i + 1
    print(t, url)
    print('Scrapping ticker {}...'.format(t))
    soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
    #writer.writerow([t])
    l = []
    for row in soup.select('.snapshot-table2 tr'):
        l.append([td.text for td in row.select('td')])
        x = list(chain.from_iterable(l))
        d = dict(zip(x[::2], x[1::2]))
        d['Index'] = t
        
    data_list.append(d)

df = DataFrame(data_list)