Question

我必须通过修改长文件来恢复和old question。

我在两个文件（File1和File2）中有两颗星的年龄。星星时代的专栏是1美元，剩下的13美元是我需要在最后打印的信息。

我试图找到一个星星年龄相同或最接近年龄的年龄。由于文件太大（~25000行），我不想在整个阵列中搜索速度问题。而且，它们在行数方面可能有很大差异（在某些情况下，假设为10000）

我不确定这是否是解决问题的最佳方法，但由于缺乏更好的方法，这是我的想法。（如果你有更快更有效的方法，请这样做）

所有值均为精度的12位小数。而现在我只关注第一栏（年龄在哪里）。

我需要不同的循环。

让我们使用文件1中的这个值：

2.326062371284e+05

首先，例程应该在file2中搜索包含

的所有匹配项

2.3260e+05

（这个循环可能会在整个数组中搜索，但是如果有一种方法可以在很快达到2.3261时停止搜索，那么它将节省一些时间）

如果只找到一个，则输出应为该值。

通常，它会找到几行，甚至可能达到1000行。就是这样，它应该再次搜索

2.32606e+05

以前建立的线之间的

。（我认为这是一个嵌套循环）然后匹配的数量将减少到~200

此时，例程应搜索

之间具有一定公差X的最佳差异

2.326062371284e+05

以及所有这200条线。

这种方式有这些文件

File1中

1.833800650355e+05 col2f1 col3f1 col4f1
1.959443501406e+05 col2f1 col3f1 col4f1
2.085086352458e+05 col2f1 col3f1 col4f1
2.210729203510e+05 col2f1 col3f1 col4f1
2.326062371284e+05 col2f1 col3f1 col4f1
2.441395539059e+05 col2f1 col3f1 col4f1
2.556728706833e+05 col2f1 col3f1 col4f1

文件2

2.210729203510e+05 col2f2 col3f2 col4f2
2.354895663228e+05 col2f2 col3f2 col4f2
2.499062122946e+05 col2f2 col3f2 col4f2
2.643228582664e+05 col2f2 col3f2 col4f2
2.787395042382e+05 col2f2 col3f2 col4f2
2.921130362004e+05 col2f2 col3f2 col4f2
3.054865681626e+05 col2f2 col3f2 col4f2

输出文件3（公差为3000）

2.210729203510e+05 2.210729203510e+05 col2f1 col2f2 col4f1 col3f2
2.326062371284e+05 2.354895663228e+05 col2f1 col2f2 col4f1 col3f2

重要条件：

输出不应包含重复的行（星号1不能在固定的年龄，星星2的不同年龄，只是最接近的行。

你会如何解决这个问题？

非常感谢！

ps：我完全改变了这个问题，因为向我显示我的推理有一些错误。谢谢！

Answer 1

不是awk解决方案，也是其他解决方案也很棒的时候，所以这里有一个使用R的答案

使用不同数据的新答案，这次不是从文件中读取来烘焙示例：

# Sample data for code, use fread to read from file and setnames to name the colmumns accordingly
set.seed(123)
data <- data.table(age=runif(20)*1e6,name=sample(state.name,20),sat=sample(mtcars$cyl,20),dens=sample(DNase$density,20))
data2 <- data.table(age=runif(10)*1e6,name=sample(state.name,10),sat=sample(mtcars$cyl,10),dens=sample(DNase$density,10))

setkey(data,'age') # Set the key for joining to the age column
setkey(data2,'age') # Set the key for joining to the age column

# get the result
result=data[ # To get the whole datas from file 1 and file 2 at end
         data2[ 
           data, # Search for each star of list 1
           .SD, # return columns of file 2
           roll='nearest',by=.EACHI, # Join on each line (left join) and find nearest value
          .SDcols=c('age','name','dens')]
       ][!duplicated(age) & abs(i.age - age) < 1e3,.SD,.SDcols=c('age','i.age','name','i.name','dens','i.dens') ] # filter duplicates in first file and on difference
# Write results to a file (change separator for wish):
write.table(format(result,digits=15,scientific=TRUE),"c:/test.txt",sep=" ")

代码：

# A nice package to have, install.packages('data.table') if it's no present
library(data.table)
# Read the data (the text can be file names)
stars1 <- fread("1.833800650355e+05
1.959443501406e+05
2.085086352458e+05
2.210729203510e+05
2.326062371284e+05
2.441395539059e+05
2.556728706833e+05")

stars2 <- fread("2.210729203510e+05
2.354895663228e+05
2.499062122946e+05
2.643228582664e+05
2.787395042382e+05
2.921130362004e+05
3.054865681626e+05")

# Name the columns (not needed if the file has a header)
colnames(stars1) <- "age"
colnames(stars2) <- "age"

# Key the data tables (for a fast join with binary search later)
setkey(stars1,'age')
setkey(stars2,'age')

# Get the result (more datils below on what is happening here :))
result=stars2[ stars1, age, roll="nearest", by=.EACHI]

# Rename the columns so we acn filter whole result
setnames(result,make.unique(names(result)))

# Final filter on difference
result[abs(age.1 - age) < 3e3]

因此，有趣的部分是两个恒星年龄列表中的第一个“连接”，搜索星星中最近的星星2。

这个给（在列重命名后）：

> result
        age    age.1
1: 183380.1 221072.9
2: 195944.4 221072.9
3: 208508.6 221072.9
4: 221072.9 221072.9
5: 232606.2 235489.6
6: 244139.6 249906.2
7: 255672.9 249906.2

现在我们最接近每个，过滤那些足够接近（绝对差异超过3 000）：

> result[abs(age.1 - age) < 3e3]
        age    age.1
1: 221072.9 221072.9
2: 232606.2 235489.6

Answer 2

Perl救援。这应该非常快，因为它在给定范围内进行二进制搜索。

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

use List::Util qw{ max min };
use constant { SIZE      => 100,
               TOLERANCE => 3000,
           };


my @times2;
open my $F2, '<', 'file2' or die $!;
while (<$F2>) {
    chomp;
    push @times2, $_;
}

my $num = 0;
open my $F1, '<', 'file1' or die $!;
while (my $time = <$F1>) {
    chomp $time;

    my $from = max(0, $num - SIZE);
    my $to   = min($#times2, $num + SIZE);
    my $between;
    while (1) {
        $between = int(($from + $to) / 2);

        if ($time < $times2[$between] && $to != $between) {
            $to = $between;

        } elsif ($time > $times2[$between] && $from != $between) {
            $from = $between;

        } else {
            last
        }
    }
    $num++;
    if ($from != $to) {
        my $f = $time - $times2[$from];
        my $t = $times2[$to] - $time;
        $between = ($f > $t) ? $to : $from;
    }
    say "$time $times2[$between]" if TOLERANCE >= abs $times2[$between] - $time;
}

最接近的值不同的文件，具有不同的行数和其他条件（bash awk other）

2 个答案: