Question

我在多个子目录中有超过5000个XML文件，名为f1，f2，f3，f4，... 每个文件夹包含200多个文件。目前我想使用BeautifulSoup提取所有文件，因为我已经尝试过lxml，elemetTree和minidom，但我很难通过BeautifulSoup完成它。

我可以在子目录中提取单个文件，但无法通过BeautifulSoup获取所有文件。

我查看了以下帖子：

XML parsing in Python using BeautifulSoup（提取单个文件）

Parsing all XML files in directory and all subdirectories（这是minidom）

Reading 1000s of XML documents with BeautifulSoup（无法通过此帖子获取文件）

以下是我为提取单个文件而编写的代码：

from bs4 import BeautifulSoup

file = BeautifulSoup(open('./Folder/SubFolder1/file1.XML'),'lxml-xml') 

print(file.prettify())

当我尝试获取所有文件夹中的所有文件时，我使用以下代码：

from bs4 import BeautifulSoup

file = BeautifulSoup('//Folder/*/*.XML','lxml-xml') 

print(file.prettify())

然后我只获得XML版本，而不是其他任何东西。我知道我必须使用for循环，我不知道如何使用它来解析循环中的所有文件。

我知道它会非常慢但是为了学习我想使用beautifulsoup来解析所有文件，或者如果不建议使用循环，那么如果我能得到更好的解决方案，我将不胜感激只有beautifulsoup。

此致

Answer 1

如果我理解正确，那么您需要循环浏览文件，正如您已经想到的那样：

from bs4 import BeautifulSoup
from pathlib import Path

for filepath in Path('./Folder').glob('*/*.XML'):
    with filepath.open() as f:
        soup = BeautifulSoup(f,'lxml-xml')
    print(soup.prettify())

pathlib只是一种使用对象在更高级别处理路径的方法。您可以使用glob和字符串路径实现相同的目标。

Answer 2

使用glob.glob查找XML文档：

import glob

from bs4 import BeautifulSoup

for filename in glob.glob('//Folder/*/*.XML'):
    content = BeautifulSoup(filename, 'lxml-xml')
    print(content.prettify())

note ：不要遮蔽内置函数/类file。

阅读BeautifulSoup Quick Start

使用python

2 个答案: