Question

我有一个包含2列和超过200万行的大文本文件。每个值代表一个id，可能有重复。大约有10万个独特的ID。

我想识别每个id并从1开始重新编号。它应该使用相同的新id重新编号重复。如果可能的话，应该使用bash来完成。

输出可能类似于：

$user = User::find(1);
$image = new UserImage(['file' => 'image.jpg', 'is_index' => true]);

$user->images()->associate($image);
$user->save();

Answer 1

使用bash和sort：

#!/bin/bash

shopt -s lastpipe
declare -A hash    # declare associative array
index=1

# read file and fill associative array
while read -r a b; do
  echo "$a"
  echo "$b"
done <file | sort -nu | while read -r x; do
  hash[$x]="$((index++))"
done

# read file and print values from associative array
while read -r a b; do
  echo "${hash[$a]} ${hash[$b]}"
done < file

输出：

请参阅：man bash和man sort

Answer 2

在纯粹的bash中这样做会非常慢。我建议：

tr -s '[:blank:]' '\n' <file |
  sort -un |
  awk '
    NR == FNR {id[$1] = FNR; next}
    {for (i=1; i<=NF; i++) {$i = id[$i]}; print}
  ' - file

Answer 3

Pure Bash，只需读取一次文件：

declare -A hash
index=1
while read -r a b; do
  [[ ${hash[$a]} ]] || hash[$a]=$((index++))   # assign index only if not set already
  [[ ${hash[$b]} ]] || hash[$b]=$((index++))   # assign index only if not set already
  printf '%s %s\n' "${hash[$a]}" "${hash[$b]}"
done < file > file.indexed

注意：

索引按读取顺序分配（不基于排序）
我们只对文件进行一次传递（不像其他解决方案那样是两次）
Bash的读取比awk慢;但是，如果在Perl或Python中实现相同的逻辑，则会更快
由于哈希查找

输出：

Answer 4

awk 'NR==FNR { ids[$1] = ++c; next }
{ print ids[$1], ids[$2] }
' <( { cut -d' ' -f1 renum.in; cut -d' ' -f2 renum.in; } | sort -nu ) renum.in

在使用sort使用此序列之前，将两列合并为一列，然后-n将其加入数字顺序（-u），并使其成为唯一（awk）生成旧的ids之间的映射数组。

然后对于输入中的每一行，交换ids并打印。

Answer 5

只要保持一个单调的计数器和一个看到数字的表格;当你看到一个新的id时，给它计数器的值并增加：

awk '!a[$1]{a[$1]=++N} {$1=a[$1]} !a[$2]{a[$2]=++N} {$2=a[$2]} 1' input

基于唯一映射对文本文件中的数字进行重新编号

5 个答案: