Question

我有一个带数字的大文件，例如：

每天我提取一些大文件的数量并将这个日期数字保存在第二个文件中。每天都有新数字添加到我的大文件中的源数据中。我需要为提取作业创建一个过滤器，以确保我不提取已经提取的数字。我如何以bash或python脚本执行此操作？

注意：我无法从源数据中删除数字“大文件”我需要它保持完整，因为当我完成从文件中提取数字时，我需要原始+更新的数据用于第二天的工作。如果我创建了该文件的副本并删除了副本的编号，则不会考虑添加的新编号。

Answer 1

将所有号码从大文件中读入一个集合，然后根据该数字测试新数字：

with open('bigfile.txt') as bigfile:
    existing_numbers = {n.strip() for n in bigfile}

with open('newfile.txt') as newfile, open('bigfile.txt', 'w') as bigfile:
    for number in newfile:
        number = number.strip()
        if number not in existing_numbers:
            bigfile.write(number + '\n')

这会尽可能高效地将bigfile中已有的数字添加到最后。

如果bigfile变得太大而无法有效运行，则可能需要使用数据库。

Answer 2

您可以将源文件的已排序版本和提取的数据保存到临时文件中，您可以使用标准POSIX工具（如comm）来显示公共行/记录。这些行记录将是您在后续提取作业中使用的“过滤器”的基础。如果您使用source.txt命令从$SHELL文件中提取记录，则grep -v [list of common lines]之类的内容将成为您脚本的一部分 - 与您用于提取记录的其他条件一样长。为了获得最佳效果，应对source.txt和extracted.txt文件进行排序。

以下是典型comm输出的快速剪切和粘贴。序列显示“大文件”，提取的数据，然后显示最终的comm命令，该命令显示source.txt文件的唯一行（有关man comm(1)的工作原理，请参阅comm ）。接下来是一个使用grep的任意模式进行搜索的示例，以及除常见文件之外的“过滤器”。

% cat source.txt                           
3120987654
3106982658
3420787642
3210957659
3320987654
3520987654
3520987754
3520987954
3520988654
3520987444

% cat extracted.txt 
3120987654
3106982658
3420787642
3210957659
3320987654

% comm -2 -3 source.txt extracted.txt  # show lines only in source.txt
3520987754
3520987954
3520988654
3520987444

comm选择或拒绝两个文件共有的行。该实用程序符合IEEE Std 1003.2-1992（“POSIX.2”）。我们可以保存其输出以用于grep：

% comm -1 -2 source.txt extracted.txt | sort > common.txt
% grep -v -f common.txt source.txt | grep -E ".*444$"

这将grep source.txt个source.txt个文件，并排除extracted.txt和|共有的行;然后管道（grep）和perl这些“过滤”结果，以提取新记录（在这种情况下是以“444”结尾的一行或多行）。如果文件非常大，或者如果要保留原始文件中数字的顺序和提取的数据，那么问题就更复杂了，响应需要更详细。

请参阅我的其他回复或使用{{1}}的简单替代方法的开始。

Answer 3

我认为您不是要求唯一值，但是您希望自上次查看文件以来添加了所有新值？

假设BigFile一直在获取新数据。

我们希望DailyFilemm_dd_yy包含过去24小时内收到的新号码。

此脚本将执行您想要的操作。每天运行它。

BigFile=bigfile
DailyFile=dailyfile
today=$(date +"%m_%d_%Y")
# Get the month, day, year for yesterday.
yesterday=$(date -jf "%s" $(($(date +"%s") - 86400)) +"%m_%d_%Y")

cp $BigFile $BigFile$today
comm -23 $BigFile $BigFile$yesterday > $DailyFile$today
rm $BigFile$yesterday

comm显示不在两个文件中的行。

comm的例子：

#values added to big file
echo '111
222
333' > big

cp big yesterday

# New values added to big file over the day
echo '444
555' >> big

# Find out what values were added.
comm -23 big yesterday > today
cat today

输出

444
555

Answer 4

懒惰perl接近。

只需编写自己的selection()子例程来替换grep {/.*444$/}; - ）

#!/usr/bin/env perl  
use strict; use warnings; use autodie;                      
use 5.16.0 ; 

use Tie::File;        
use Array::Utils qw(:all); 

tie my @source, 'Tie::File', 'source.txt' ;               
tie my @extracted, 'Tie::File', 'extracted.txt' ;

# Find the intersection                                                   
my @common = intersect(@source, @extracted);                      

say "Numbers already extracted"; 
say for @common       

untie @@source;
untie @extracted;

更新source.txt文件后，您可以从中进行选择：

#!/usr/bin/env perl  
use strict; use warnings; use autodie;              
use 5.16.0 ; 

use Tie::File;        
use Array::Utils qw(:all); 

tie my @source, 'Tie::File', 'source.txt' ;               
tie my @extracted, 'Tie::File', 'extracted.txt' ;

# Find the intersection                                                   
my @common = intersect(@source, @extracted);                      

# Select from source.txt excluding numbers already selected:
my @newselect = array_minus(@source, @common);
say "new selection:";
# grep returns list $selection needs "()" for list context.
my ($selection) = grep {/.*444$/} @newselect; 
push @extracted, $selection ;
say "updated extracted.txt" ; 

untie @@source;
untie @extracted;

这使用两个模块...欢迎简洁和惯用的版本！

使用shell / python / perl仅从文件中提取一次行

4 个答案:

输出