从for循环内追加到数据框的下一行

时间:2019-07-23 09:19:59

标签: python python-3.x pandas

我创建了一个网络抓取工具,该抓取工具仅针对Python编程教育目的来抓取股票的Yahoo财务摘要和统计信息页面。它从程序目录中的“ 1stocklist.csv”中读取,如下所示:

Symbols
SNAP
KO

从那里,它将新信息按原样添加到数据框中的新列。那里有很多“ for”循环,我仍在对其进行调整,因为它不能正确地获取一些数据,但是现在还可以。

我的问题是尝试将数据框保存到新的.csv文件。如您所见,它现在的输出方式是这样的:

wrong output

SNAP行应从14.02开始,然后一切都正确,下一行应为KO,从51.39开始。

有什么想法吗?只需创建一个类似于上面的1stocklist.csv文件,然后尝试即可。谢谢!

# Import dependencies
from bs4 import BeautifulSoup
import re, random, time, requests, datetime, csv
import pandas as pd
import numpy as np


# Use Pandas to read the "1stocklist.csv" file. We'll use Pandas so that we can append a 'dataframe' with new
# information we get from the Zacks site to work with in the program and output to the 'data(date).csv' file later
maindf = pd.read_csv('1stocklist.csv', skiprows=1, names=[
# The .csv header names
    "Symbols"
    ]) #, delimiter = ',')

# Setting a time delay will help keep scraping suspicion down and server load down when scraping the Zacks site
timeDelay = random.randrange(2, 8)


# Start scraping Yahoo
print('Beginning to scrape Yahoo Finance site for information ...')
tickerlist = len(maindf['Symbols']) # for progress bar


# Create a progress counter to display how far along in the zacks rank scraping it is
zackscounter = 1

# For every ticker in the stocklist dataframe
for ticker in maindf['Symbols']:

# Print the progress
    print(zackscounter, ' of ', tickerlist, ' - ', ticker) # for seeing which stock it's currently on

# The list of URL's for the stock's different pages to scrape the information from
    summaryurl = 'https://ca.finance.yahoo.com/quote/' + ticker
    statsurl = 'https://ca.finance.yahoo.com/quote/' + ticker + '/key-statistics'

# Define the headers to use in Beautiful Soup 4
    headers = requests.utils.default_headers()
    headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'

# Employ random time delay now before starting with the (next) ticker
    time.sleep(timeDelay)





# Use Beautiful Soup 4 to get the info from the first Summary URL page
    page = requests.get(summaryurl, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')

    counter = 0 # used to tell which 'td' it's currently looking at
    table = soup.find('div', {'id' :'quote-summary'})
    for i in table.find_all('span'):
        counter += 1
        if counter % 2 == 0: # All Even td's are the metrics/numbers we want
            data_point = i.text
            #print(data_point)
            maindf[column_name] = data_point # Add the data point to the right column
        else:                # All odd td's are the header names
            column_name = i.text
            #print(column_name)





# Use Beautiful Soup 4 to get the info from the second stats URL page
    page = requests.get(statsurl, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')
    time.sleep(timeDelay)
# Get all the data in the tables
    counter = 0 # used to tell which 'td' it's currently looking at
    table = soup.find('section', {'data-test' :'qsp-statistics'})
    for i in table.find_all('td'):
        counter += 1
        if counter % 2 == 0: # All Even td's are the metrics/numbers we want
            data_point = i.text
            #print(data_point)
            maindf[column_name] = data_point # Add the data point to the right column
        else:                # All odd td's are the header names
            column_name = i.text
            #print(column_name)





    file_name = 'data_raw.csv'
    if zackscounter == 1:
        maindf.to_csv(file_name, index=False)
    else:
        maindf.to_csv(file_name, index=False, header=False, mode='a')

    zackscounter += 1
    continue

更新:

我知道这与我尝试在最后将数据框附加到.csv文件有关。我的起始数据框只是其中包含所有股票代码的一列,然后尝试在程序进行时将每个新列添加到数据框中,并填充到股票列表的底部。我想要做的就是只添加应有的column_name标头,然后将特定于特定数据的数据附加到一个代码中,然后对我数据框“符号”列中的每个代码进行相应处理。希望可以使问题更加清晰吗?

我尝试以各种方式使用.loc,但没有成功。谢谢!

1 个答案:

答案 0 :(得分:0)

使用答案更新

我能够弄清楚!

基本上,我将从1stocklist.csv读取的第一个数据框更改为其自己的数据框,然后创建了一个新的空白框,可在第一个for循环中使用。这是我创建的更新后的头像:

# Use Pandas to read the "1stocklist.csv" file. We'll use Pandas so that we can append a 'dataframe' with new
# information we get from the Zacks site to work with in the program and output to the 'data(date).csv' file later
opening_dataframe = pd.read_csv('1stocklist.csv', skiprows=1, names=[
# The .csv header names
    "Symbols"
    ]) #, delimiter = ',')

# Setting a time delay will help keep scraping suspicion down and server load down when scraping the Zacks site
timeDelay = random.randrange(2, 8)


# Start scraping Yahoo
print('Beginning to scrape Yahoo Finance site for information ...')
tickerlist = len(opening_dataframe['Symbols']) # for progress bar


# Create a progress counter to display how far along in the zacks rank scraping it is
zackscounter = 1

# For every ticker in the stocklist dataframe
for ticker in opening_dataframe['Symbols']:

    maindf = pd.DataFrame(columns=['Symbols'])

    maindf.loc[len(maindf)] = ticker

# Print the progress
    print(zackscounter, ' of ', tickerlist, ' - ', ticker) # for seeing which stock it's currently on

# The list of URL's for the stock's different pages to scrape the information from
    summaryurl = 'https://ca.finance.yahoo.com/quote/' + ticker
    statsurl = 'https://ca.finance.yahoo.com/quote/' + ticker + '/key-statistics'
......
......
......

注意更改“ opening_dataframe = ...”的名称,并且

maindf = pd.DataFrame(columns = ['Symbols']) maindf.loc [len(maindf)] =股票代码

部分。我还利用.loc将其添加到数据框中的下一个可用行。希望这对某人有帮助!