Question

我正在使用bs4和python 3.5解析Wikipedia元数据文件

这可以从（更大）文件的测试切片中提取：

from bs4 import BeautifulSoup

with open ("Wikipedia/test.xml", 'r') as xml_file:
    xml = xml_file.read()

print(BeautifulSoup(xml, 'lxml').select("timestamp"))

问题在于元数据文件都是12个月以上的演出，因此我不想让整个文件在进行ensoupification之前像字符串那样lur草，而是希望BeautifulSoup作为迭代器读取数据（甚至甚至可以从gzcat读取以避免将数据放在未压缩的文件中）。

但是，我尝试将BS以外的任何东西交给BS都会使它窒息。有没有办法让BS以流而不是字符串的形式读取数据？

Answer 1

您可以为BS提供文件句柄对象。

with open("Wikipedia/test.xml", 'r') as xml_file:
    soup = BeautifulSoup(xml_file, 'lxml')

这是Making the Soup文档中的第一个示例

Answer 2

BeautifulSoup或lxml没有流选项，但是您可以使用var Channel = message.channel.name if (message.content === "command") { if(Channel != "Channel name here") { message.channel.send('Cannot use command here, ' + message.author); } else { // Insert command code here } }读取块中的大型xml文件

iterparse()

了解更多here或here

使用迭代而不是字符串的BeautifulSoup？

2 个答案: