Question

我在几个文件夹中有成千上万个html文件，我想从注释中提取数据并放入csv文件中。这将使我可以格式化和清理项目。例如，我在此文件夹中有640个html文件：

D：\我的网站\棒球2 \ www.baseball-reference.com \ boxes \ ANA

这是我的代码，它从一个文件中提取注释并放入CSV文件中：

# import libraries and files
from bs4 import BeautifulSoup, Comment
import re
import csv

# Get Page, Make Soup
soup = BeautifulSoup(open("D:/My Web Sites/baseball 2/www.baseball-reference.com/boxes/ANA/ANA201806180.html"), 'lxml')

# Get Description
game_description = soup.findAll("div", {"scorebox_meta"})
print (game_description)

# Get Comment Data
Player_Data = soup.find_all(string=lambda text:isinstance(text,Comment))
for c in Player_Data:
    print c
    print "==========="

# Results to CSV
csvfile = "C:/Users/Benny/Desktop/anatest.csv"

with open(csvfile, "w") as output:
    writer = csv.writer(output, lineterminator='\n')
    writer.writerows(Player_Data)

我需要从每个html文件中提取游戏数据（所有数据都嵌套在html代码中的注释中），然后将每个游戏文件的单独抓取结果放入单个CSV中。非常感谢您提供有关代码的帮助。

谢谢，本尼

Answer 1

您可以使用os.listdir模块来遍历目录中的所有文件。或者，您也可以使用glob模块。

例如：（os.listdir

import os
path = r"D:\My Web Sites\baseball 2\www.baseball-reference.com\boxes\ANA"

for filename in os.listdir(path):
    if filename.endswith(".html"):
        fullpath = os.path.join(path, filename)

        # Get Page, Make Soup
        soup = BeautifulSoup(open(fullpath), 'lxml')
        .....

对本地文件夹中的所有文件重复BeautifulSoup抓取

1 个答案: