将panda DataFrame(INSERT NEW COLUMN)调用到Python脚本中

时间:2017-11-16 19:02:12

标签: python pandas beautifulsoup

我正在尝试添加游戏ID'列到我正在抓取的表中(参见下面的脚本)。我不知道在哪里广告pd.Dataframe以及要调用什么(在我的网页中),以便我可以插入一个名为'游戏ID'的新列。之前我将脚本编写到csv文件中(以便使用新的游戏id列进行写入)。

(只是一些背景信息:'游戏ID'是scrape从网址迭代的循环中的i)

我试着进入

  • df.insert(0,' GameID',范围(1,1 + len(df)))或
  • df [' GameID'] =(df.index / 18 + 1).astype(int)

但我不知道该怎么称呼我的数据帧(我试过pd.Dataframe [table,columns =' cols]但它不会读它)。

#ALL HOME GOALIES GAME STATS

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv


f = open('HOME_GOALIES_ALL.csv', 'a', newline = '')
writer = csv.writer(f)

GameID = i
for i in range (400961844,400961845):
    url = requests.get("http://www.espn.com/nhl/boxscore?gameId={}".format(i))
    if not url.ok:
        continue
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    table = soup.find_all('table', {'class' : 'mod-data'})[8].find_all('tr')[2:]
    for row in table:
        cols = row.findChildren(recursive=False)
        cols = [ele.text.strip() for ele in cols]
        writer.writerow(cols)

1 个答案:

答案 0 :(得分:0)

您的代码中没有DataFrame,但是,可以按照以下方式执行此操作:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
table = []
df = pd.DataFrame()
for i in range (400961844,400961848):
    url = requests.get("http://www.espn.com/nhl/boxscore?gameId={}".format(i))
    if not url.ok:
        continue
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    #Add the game ID to the list of soups to keep track of multiple players with same game ID
    table.append((i,soup.find_all('table', {'class' : 'mod-data'})[8].find_all('tr')[2:]))


data = []
soups = []
game_id = []
for i,t in table:
#Use .contents method to turn the soup into list of items
    soups = [j.contents for j in t]
    for s in soups:
#Use .string method to parse the values of different columns
        data.append([a.string for a in s])
#Append the Game ID
        game_id.append(i)

In [58]:
data
Out[58]:
[['H. Lundqvist', '25', '3', '22', '.880', '58:19', '0'],
 ['C. Anderson', '28', '4', '24', '.857', '65:00', '0'],
 ['J. Howard', '39', '2', '37', '.949', '60:00', '0'],
 ['C. Crawford', '29', '1', '28', '.966', '59:56', '0'],
 ['J. Gibson', '30', '4', '26', '.867', '59:53', '10'],
 ['J. Quick', '35', '0', '35', '1.000', '59:59', '0'],
 ['S. Bobrovsky', '29', '0', '29', '1.000', '59:53', '0'],
 ['A. Vasilevskiy', '36', '3', '33', '.917', '60:00', '0'],
 ['K. Lehtonen', '11', '2', '9', '.818', '15:00', '0'],
 ['B. Bishop', '19', '0', '19', '1.000', '43:58', '0'],
 ['F. Andersen', '35', '5', '30', '.857', '60:00', '0']]


#Create a DataFrame from the data extracted
df = pd.DataFrame(data)

In [59]:
df

Out[59]:
        0             1     2   3    4       5      6
0   H. Lundqvist      25    3   22  .880    58:19   0
1   C. Anderson       28    4   24  .857    65:00   0
2   J. Howard         39    2   37  .949    60:00   0
3   C. Crawford       29    1   28  .966    59:56   0   
4   J. Gibson         30    4   26  .867    59:53   10
5   J. Quick          35    0   35  1.000   59:59   0
6   S. Bobrovsky      29    0   29  1.000   59:53   0   
7   A. Vasilevskiy    36    3   33  .917    60:00   0
8   K. Lehtonen       11    2   9   .818    15:00   0
9   B. Bishop         19    0   19  1.000   43:58   0
10  F. Andersen       35    5   30  .857    60:00   0

可以使用:df.columns = [list_of_columns_names]

设置列名称

现在重要的是,要添加“游戏ID”列,您可以使用我们之前创建的game_id列表:df['Game ID'] = game_id

最后将DataFrame写为CSV文件:df.to_csv('path_of_file')