Question

下面是一个刮板，它在两个网站之间循环，刮擦团队的花名册信息，将信息放入数组中，然后将数组导出到CSV文件中。一切正常，但唯一的问题是每次刮板移动到第二个网站时，csv文件中都会重复写入行标题。是否可以调整代码的CSV部分，以使标头仅在刮板遍历多个网站时才出现一次？预先感谢！

import requests
import csv
from bs4 import BeautifulSoup

team_list={'yankees','redsox'}

for team in team_list:
    page = requests.get('http://m.{}.mlb.com/roster/'.format(team))
    soup = BeautifulSoup(page.text, 'html.parser')

    soup.find(class_='nav-tabset-container').decompose()
    soup.find(class_='column secondary span-5 right').decompose()

    roster = soup.find(class_='layout layout-roster')
    names = [n.contents[0] for n in roster.find_all('a')]
    ids = [n['href'].split('/')[2] for n in roster.find_all('a')]
    number = [n.contents[0] for n in roster.find_all('td', index='0')]
    handedness = [n.contents[0] for n in roster.find_all('td', index='3')]
    height = [n.contents[0] for n in roster.find_all('td', index='4')]
    weight = [n.contents[0] for n in roster.find_all('td', index='5')]
    DOB = [n.contents[0] for n in roster.find_all('td', index='6')]
    team = [soup.find('meta',property='og:site_name')['content']] * len(names)

    with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
        f = csv.writer(fp)
        f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])
        f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))

Answer 1

使用变量检查是否添加了标头可能会有所帮助。如果添加了标题，则不会第二次添加

header_added = False
for team in team_list:
    do_some stuff

    with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
        f = csv.writer(fp)
        if not header_added:
            f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])
            header_added = True
        f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))

Answer 2

另一种方法是简单地在for循环之前执行此操作，因此您不必检查是否已编写。

import requests
import csv
from bs4 import BeautifulSoup

team_list={'yankees','redsox'}

with open('MLB_Active_Roster.csv', 'w', newline='') as fp:
    f = csv.writer(fp)
    f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])

for team in team_list:
    do_your_bs4_and_parsing_stuff

    with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
        f = csv.writer(fp)
        f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))

您也可以一次打开文档，而不是三遍

import requests
import csv
from bs4 import BeautifulSoup

team_list={'yankees','redsox'}

with open('MLB_Active_Roster.csv', 'w', newline='') as fp:
    f = csv.writer(fp)
    f.writerow(['Name','ID','Number','Hand','Height','Weight','DOB','Team'])

    for team in team_list:
        do_your_bs4_and_parsing_stuff

        f.writerows(zip(names, ids, number, handedness, height, weight, DOB, team))

Answer 3

只需在循环之前编写标头，并将循环放入with上下文管理器中即可：

import requests
import csv
from bs4 import BeautifulSoup

team_list = {'yankees', 'redsox'}

headers = ['Name', 'ID', 'Number', 'Hand', 'Height', 'Weight', 'DOB', 'Team']

# 1. wrap everything in context manager
with open('MLB_Active_Roster.csv', 'a', newline='') as fp:
    f = csv.writer(fp)

    # 2. write headers before anything else
    f.writerow(headers)

    # 3. now process the loop
    for team in team_list:
        # Do everything else...

您还可以在循环外类似于team_list定义标头，从而使代码更简洁。

在Python CSV Writer循环中一次写入标题

3 个答案: