使用空格字符排序顺序

时间:2016-03-22 22:08:14

标签: bash python-2.7 sorting awk

我有一个包含化学名称的TSV文件,其中可以包含括号,逗号,+和 - 符号,空格字符,[],{}等。

$ cat random.txt 
ACETYLTHIOCHOLINE CHLORIDE  CDRD-00117030-01
ACETYLTRYPTOPHANAMIDE   CDRD-00118894-01
ACETYL ISOGAMBOGIC ACID CDRD-00119007-01
ACETYLTRYPTOPHAN    CDRD-00117996-01
ACETYL ISOALLOGAMBOGIC ACID CDRD-00118740-01
ACETAMINOPHEN   CDRD-00116365-01
ACETAMIDE   CDRD-00116997-01
ACETYLSALICYLIC ACID    CDRD-00117028-01
ACETYLSALICYLSALICYLIC ACID CDRD-00115640-01
ACETYL TYROSINE ETHYL ESTER CDRD-00118256-01

我想要的是在第一列中排序的文件:

$ cat correct.txt 
ACETAMIDE       CDRD-00116997-01
ACETAMINOPHEN       CDRD-00116365-01
ACETYL ISOALLOGAMBOGIC ACID     CDRD-00118740-01
ACETYL ISOGAMBOGIC ACID     CDRD-00119007-01
ACETYL TYROSINE ETHYL ESTER     CDRD-00118256-01
ACETYLSALICYLIC ACID        CDRD-00117028-01
ACETYLSALICYLSALICYLIC ACID     CDRD-00115640-01
ACETYLTHIOCHOLINE CHLORIDE      CDRD-00117030-01
ACETYLTRYPTOPHAN    CDRD-00117996-01
ACETYLTRYPTOPHANAMIDE   CDRD-00118894-01

我得到了什么:

$ sort -k1,1 -t $'\t' -f -n random.txt > wrong.txt
$ cat wrong.txt          
ACETAMIDE   CDRD-00116997-01
ACETAMINOPHEN    CDRD-00116365-01
ACETYL ISOALLOGAMBOGIC ACID     CDRD-00118740-01
ACETYL ISOGAMBOGIC ACID  CDRD-00119007-01
ACETYLSALICYLIC ACID     CDRD-00117028-01
ACETYLSALICYLSALICYLIC ACID  CDRD-00115640-01
ACETYLTHIOCHOLINE CHLORIDE   CDRD-00117030-01
ACETYLTRYPTOPHANAMIDE    CDRD-00118894-01
ACETYLTRYPTOPHAN    CDRD-00117996-01
ACETYL TYROSINE ETHYL ESTER  CDRD-00118256-01

请注意ACETYL TYROSINE ETHYL ESTER ACETYL ISOGAMBOGIC ACID之后应为,而ACETYLTRYPTOPHAN ACETYLTRYPTOPHANAMIDE之前应为

原因是join抱怨ACETYL TYROSINE ETHYL ESTER没有排序(并且在第一次修复后约为ACETYLTRYPTOPHAN):

join的第二个文件:

$ cat test_data.txt
Acetamide   0.904   0.146   0.134   -0.196
Acetyltryptophan    -0.558  -0.471  -0.13   -0.332

join wrong.txt的结果:

$ join -a1 -1 1 -2 1 -t $'\t' -i test_data.txt wrong.txt 
Acetamide   0.904   0.146   0.134   -0.196  CDRD-00116997-01
Acetyltryptophan    -0.558  -0.471  -0.13   -0.332
join: wrong.txt:9: is not sorted: ACETYLTRYPTOPHAN  CDRD-00117996-01

当然,joincorrect.txt有效:

$ join -a1 -1 1 -2 1 -t $'\t' -i test_data.txt correct.txt
Acetamide   0.904   0.146   0.134   -0.196  CDRD-00116997-01
Acetyltryptophan    -0.558  -0.471  -0.13   -0.332  CDRD-00117996-01

sort调用不会给我所需的输出:

$ sort -k1,1 -t $'\t' -f -V random.txt 
ACETAMIDE   CDRD-00116997-01
ACETAMINOPHEN   CDRD-00116365-01
ACETYLSALICYLIC ACID    CDRD-00117028-01
ACETYLSALICYLSALICYLIC ACID CDRD-00115640-01
ACETYLTHIOCHOLINE CHLORIDE  CDRD-00117030-01
ACETYLTRYPTOPHAN    CDRD-00117996-01
ACETYLTRYPTOPHANAMIDE   CDRD-00118894-01
ACETYL ISOALLOGAMBOGIC ACID CDRD-00118740-01
ACETYL ISOGAMBOGIC ACID CDRD-00119007-01
ACETYL TYROSINE ETHYL ESTER CDRD-00118256-01

如何让sort输出我想要的内容?

2 个答案:

答案 0 :(得分:3)

可能你的语言环境搞乱了。尝试:

LANG=C sort -k1,1 -t $'\t' -f random.txt

现金: https://superuser.com/questions/625223/sort-tab-delimited-text-fields-involving-spaces

答案 1 :(得分:1)

删除-n(数字排序),它应该起作用:

sort -k1,1 -t $'\t' random.txt

当然,假设您使用的语言环境具有您需要的排序顺序。您可以将其更改为测试(如果需要,并且已在系统中编译了语言环境)

LC_COLLATE=en_US.utf8

你在哪个国家/地区/语言工作?

这项工作正确:

LC_COLLATE=C sort -k1,1 -t $'\t' random.txt