Question

我有很多（大约6000个）文本文件，每个文件都有一个ID列表（新行中的文件中的每个ID）。每个文件中可能有10000到1000万个ID。

如何从所有这些文件中获取一组唯一ID？

我目前的代码如下：

import glob
kk=glob.glob('C://Folder_with_all_txt_files/*')
ID_set=set()
for source in kk:
    a=[]
    csvReader = csv.reader(open(source, 'rt'))
    for row in csvReader:
        a.append(row)
    for i in xrange(len(a)):
        a[i]=a[i][0]
    s=set(a)
    ID_set=ID_set.union(s)
    del a,s

当前代码出现问题：

1）消耗太多内存
2）太慢了

有更有效的方法来完成这项任务吗？

此外，是否可以在此任务中使用所有CPU内核？

Answer 1

一些想法：

跳过set s 的创建。只需直接更新 ID_set 。
根据文件的外观，您可以使用 read（）和 str.split（）而不是CSV阅读器。

这样的事情可能适用于您的数据集：

import glob

id_set = set()
for filename in glob.glob('C://Folder_with_all_txt_files/*'):
    with open(filename) as f:
        ids = f.read().split()
        id_set.update(ids)

Answer 2

这种方法可能比Raymond慢一点，但它避免了立即将每个文件加载到内存中：

<div class="container">
    <div class="col-md-12">
        <h2 class="main-hadding">our blog news</h2>
    </div>
    <div class="blog-section">
        <div class="row"> 

    <?php
//display 2 posts for category id 47
    $args=array(
    //  'cat' => 47,
      'post_type' => 'post',
      'post_status' => 'publish',
      'posts_per_page' => 2,
      'caller_get_posts'=> 1
      );
    $my_query = null;
    $my_query = new WP_Query($args);
    if( $my_query->have_posts() ) {

      while ($my_query->have_posts()) : $my_query->the_post(); 
          $Post_ID = get_the_ID ();


      ?>

<!--        Post Thumbnail-->
          <div class="col-md-6 col-sm-6"> 
              <?php
if ( has_post_thumbnail() ) {
    the_post_thumbnail();
} 
     ?>     
        </div> 

<!--        Post Title and Content-->
     <div class="col-md-6 col-sm-6">
            <div class=" blog-contant">
                <h1> <a href="<?php the_permalink() ?>" rel="bookmark" title="Permanent Link to <?php the_title_attribute(); ?>"><?php the_title(); ?></a></h1>
                <p><?php the_excerpt();?></p>
            </div>
        </div>

       <?php
      endwhile;
    }
wp_reset_query();  // Restore global post data stomped by the_post().
?>
      </div>


 </div>
    <div class="col-md-12 text-center btn-more"> <a href="#!" class="">MORE BLOG NEWS</a> </div>

</div>

如何有效地从许多列表中获取一组唯一值（Python）

2 个答案: