Python:运行脚本一次写几个文件

时间:2012-06-21 17:25:37

标签: python xml performance beautifulsoup

编程新手。我编写了一个Python代码(在stackoverflow的帮助下)读取目录中的XML文件,并使用BeautifulSoup复制元素的文本并将其写入输出目录。这是我的代码:

from bs4 import BeautifulSoup
import os, sys

source_folder='/Python27/source_xml'

for article in os.listdir(source_folder):
    soup=BeautifulSoup(open(source_folder+'/'+article))

    #I think this strips any tags that are nested in the sample_tag
    clean_text=soup.sample_tag.get_text("  ",strip=True)

    #This grabs an ID which I used as the output file name
    article_id=soup.article_id.get_text("  ",strip=True)

    with open(article_id,"wb") as f:
        f.write(clean_text.encode("UTF-8"))

问题是我需要这个才能运行100,000个XML文件。我昨晚启动了这个程序,估计需要花费15个小时才能运行。例如,有没有办法可以一次读取/写入100或1000,而不是一次读写一个?

2 个答案:

答案 0 :(得分:3)

多处理模块是你的朋友。您可以使用process pool自动缩放所有cpus中的代码,如下所示:

from bs4 import BeautifulSoup
import os, sys
from multiprocessing import Pool


source_folder='/Python27/source_xml'

def extract_text(article):
    soup=BeautifulSoup(open(source_folder+'/'+article))

    #I think this strips any tags that are nested in the sample_tag
    clean_text=soup.sample_tag.get_text("  ",strip=True)

    #This grabs an ID which I used as the output file name
    article_id=soup.article_id.get_text("  ",strip=True)

    with open(article_id,"wb") as f:
            f.write(clean_text.encode("UTF-8"))

def main():
    pool = Pool()
    pool.map(extract_text, os.listdir(source_folder))

if __name__ == '__main__':
    main()

答案 1 :(得分:0)

就像我讨厌这个想法一样,你可以使用线程。 python中的GIL是在I / O操作上发布的,因此您可以运行线程池并只消耗作业。