我有一个包含化学名称的TSV
文件,其中可以包含括号,逗号,+和 - 符号,空格字符,[],{}等。
$ cat random.txt
ACETYLTHIOCHOLINE CHLORIDE CDRD-00117030-01
ACETYLTRYPTOPHANAMIDE CDRD-00118894-01
ACETYL ISOGAMBOGIC ACID CDRD-00119007-01
ACETYLTRYPTOPHAN CDRD-00117996-01
ACETYL ISOALLOGAMBOGIC ACID CDRD-00118740-01
ACETAMINOPHEN CDRD-00116365-01
ACETAMIDE CDRD-00116997-01
ACETYLSALICYLIC ACID CDRD-00117028-01
ACETYLSALICYLSALICYLIC ACID CDRD-00115640-01
ACETYL TYROSINE ETHYL ESTER CDRD-00118256-01
我想要的是在第一列中排序的文件:
$ cat correct.txt
ACETAMIDE CDRD-00116997-01
ACETAMINOPHEN CDRD-00116365-01
ACETYL ISOALLOGAMBOGIC ACID CDRD-00118740-01
ACETYL ISOGAMBOGIC ACID CDRD-00119007-01
ACETYL TYROSINE ETHYL ESTER CDRD-00118256-01
ACETYLSALICYLIC ACID CDRD-00117028-01
ACETYLSALICYLSALICYLIC ACID CDRD-00115640-01
ACETYLTHIOCHOLINE CHLORIDE CDRD-00117030-01
ACETYLTRYPTOPHAN CDRD-00117996-01
ACETYLTRYPTOPHANAMIDE CDRD-00118894-01
我得到了什么:
$ sort -k1,1 -t $'\t' -f -n random.txt > wrong.txt
$ cat wrong.txt
ACETAMIDE CDRD-00116997-01
ACETAMINOPHEN CDRD-00116365-01
ACETYL ISOALLOGAMBOGIC ACID CDRD-00118740-01
ACETYL ISOGAMBOGIC ACID CDRD-00119007-01
ACETYLSALICYLIC ACID CDRD-00117028-01
ACETYLSALICYLSALICYLIC ACID CDRD-00115640-01
ACETYLTHIOCHOLINE CHLORIDE CDRD-00117030-01
ACETYLTRYPTOPHANAMIDE CDRD-00118894-01
ACETYLTRYPTOPHAN CDRD-00117996-01
ACETYL TYROSINE ETHYL ESTER CDRD-00118256-01
请注意ACETYL TYROSINE ETHYL ESTER
ACETYL ISOGAMBOGIC ACID
之后应为,而ACETYLTRYPTOPHAN
ACETYLTRYPTOPHANAMIDE
之前应为。
原因是join
抱怨ACETYL TYROSINE ETHYL ESTER
没有排序(并且在第一次修复后约为ACETYLTRYPTOPHAN
):
join
的第二个文件:
$ cat test_data.txt
Acetamide 0.904 0.146 0.134 -0.196
Acetyltryptophan -0.558 -0.471 -0.13 -0.332
join
wrong.txt
的结果:
$ join -a1 -1 1 -2 1 -t $'\t' -i test_data.txt wrong.txt
Acetamide 0.904 0.146 0.134 -0.196 CDRD-00116997-01
Acetyltryptophan -0.558 -0.471 -0.13 -0.332
join: wrong.txt:9: is not sorted: ACETYLTRYPTOPHAN CDRD-00117996-01
当然,join
与correct.txt
有效:
$ join -a1 -1 1 -2 1 -t $'\t' -i test_data.txt correct.txt
Acetamide 0.904 0.146 0.134 -0.196 CDRD-00116997-01
Acetyltryptophan -0.558 -0.471 -0.13 -0.332 CDRD-00117996-01
此sort
调用不会给我所需的输出:
$ sort -k1,1 -t $'\t' -f -V random.txt
ACETAMIDE CDRD-00116997-01
ACETAMINOPHEN CDRD-00116365-01
ACETYLSALICYLIC ACID CDRD-00117028-01
ACETYLSALICYLSALICYLIC ACID CDRD-00115640-01
ACETYLTHIOCHOLINE CHLORIDE CDRD-00117030-01
ACETYLTRYPTOPHAN CDRD-00117996-01
ACETYLTRYPTOPHANAMIDE CDRD-00118894-01
ACETYL ISOALLOGAMBOGIC ACID CDRD-00118740-01
ACETYL ISOGAMBOGIC ACID CDRD-00119007-01
ACETYL TYROSINE ETHYL ESTER CDRD-00118256-01
如何让sort
输出我想要的内容?
答案 0 :(得分:3)
可能你的语言环境搞乱了。尝试:
LANG=C sort -k1,1 -t $'\t' -f random.txt
现金: https://superuser.com/questions/625223/sort-tab-delimited-text-fields-involving-spaces
答案 1 :(得分:1)
删除-n(数字排序),它应该起作用:
sort -k1,1 -t $'\t' random.txt
当然,假设您使用的语言环境具有您需要的排序顺序。您可以将其更改为测试(如果需要,并且已在系统中编译了语言环境)
LC_COLLATE=en_US.utf8
你在哪个国家/地区/语言工作?
这项工作正确:
LC_COLLATE=C sort -k1,1 -t $'\t' random.txt