Question

我在os中下载了很多html商店，现在获取他们的内容，并将我需要持久化的数据提取到mysql，我一个接一个地使用传统的加载文件，它的效率不是8分钟。

欢迎任何建议

g_fields=[
 'name',
 'price',
 'productid',
 'site',
 'link',
 'smallImage',
 'bigImage',
 'description',
 'createdOn',
 'modifiedOn',
 'size',
 'weight',
 'wrap',
 'material',
 'packagingCount',
 'stock',
 'location',
 'popularity',
 'inStock',
 'categories',
]   @cost_time
def batch_xml2csv():
    "批量将xml导入到一个csv文件中"
    delete(g_xml2csv_file)
    f=open(g_xml2csv_file,"a")
    import os.path
    import mmap
    for file in glob.glob(g_filter):
    print "读入%s"%file
    ff=open(file,"r+")
    size=os.path.getsize(file)
    data=mmap.mmap(ff.fileno(),size)
    s=pq(data.read(size))
    data.close()
    ff.close()
    #s=pq(open(file,"r").read())
    line=[]
    for field in g_fields:
        r=s("field[@name='%s']"%field).text()
        if r is None:
            line.append("\N")
        else:
            line.append('"%s"'%r.replace('"','\"'))
    f.write(",".join(line)+"\n")
    f.close()
    print "done!"

我试过mmap，看起来效果不好

Answer 1

如果你在磁盘上有25,000个文本文件，'你做错了'。根据您将它们存储在磁盘上的方式，缓慢可能在磁盘上寻找文件。

如果您有25,000个任何，如果将它放在具有智能索引的数据库中会更快 - 即使您将索引字段设为文件名也是如此要快点

如果您有多个深度下降N级别的目录，则数据库仍然会更快。

Answer 2

如果使用scrapy，则可以在多个线程中下载文件时扫描文件。

Answer 3

如果算法正确，使用 psycho 模块有时可以提供很多帮助。但它不适用于Python 2.7或Python 3 +

用python快速读取25k小文本文件内容

3 个答案: