Question

我有两个以这种方式格式化的文件：

File1中：

word token occurence

文件2：

token occurence

我想要的是具有此输出的第三个文件：

word token occurrence1/occurence2

这是我的代码：

while read token pos count
do
    #get pos counts
    poscount=$(grep "^$pos" $2 | cut -f 2)
    #calculate probability
    prob=$(echo "scale=5;$count / $poscount" | bc -l)
    #print token, pos-tag & probability
    echo -e "$token\t$pos\t$prob"
done < $1

问题是我的输出是这样的：

-   :   .25000
:   :   .75000
'   ''  1.00000
0   CD  .00396
1000    CD  .00793
13  CD  .00793
13th    JJ  .00073
36
29
16  CD  .00396
17  CD  .00396

有些数字的行我不知道它们来自哪里，它们不在以前的文件中。

为什么会出现这些数字？有没有办法删除这些行？提前谢谢！

Answer 1

使用paste，cut和＆amp;的方法dc：

echo "5 k $(paste file[12] | cut -f 3,5) / p" | dc | \
paste file1 - | cut --complement -f 3

使用bash，paste＆amp;的方法dc：

paste <(join -1 2 file1 -2 1 file2 -o 1.1,1.2)  \
  <(echo "5 k $(join -1 2 file1 -2 1 file2 -o 1.3,2.2) / p" | dc)

`bc`除法的管道输入生成随机数

1 个答案: