我怎样才能从文本中提取一些信息

时间:2018-12-20 19:30:16

标签: bash

我有一个这样的文本文件

sp|O15304|SIVA_HUMAN MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET IGPDGR
tr|A0A1B1L9R9|A0A1B1L9R9_BACTU MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL NKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWM

我正在尝试将两组的第三部分小写。我尝试了以下操作,但不起作用

awk '{ gsub($3, tolower($3)); print $1"\t"$2}'

我有Mac,还有其他方法吗?

5 个答案:

答案 0 :(得分:1)

您正在分割默认的awk分隔符,以获取$ 1和$ 2。然后,您需要在“ |”上分割$ 1并小写$ 1的第三部分?

$awk '{split($1,a,"|") ; print a[1] "|" a[2] "|" tolower(a[3]) "\t" $2 "\t" $3}' test.txt

sp|O15304|siva_human    MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET
tr|A0A1B1L9R9|a0a1b1l9r9_bactu  MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL

答案 1 :(得分:1)

使用读入声明为小写字母的变量。

在所有这些示例中,我正在打印用方括号([])包裹的节,以便您可以看到其解析方式,而我只是在它们之间放置空格。您可以编辑所有内容。重要的部分是了解定义分隔的内容,并将正确的部分放入将其小写的变量中。

declare -l three
while IFS='|' read -r one two three
do echo "[$one] [$two] [$three]"
done < infile
[sp] [O15304] [siva_human mpkrscpfadvaplqlkvrvsqrelsrgvcaerysqevfektkrllflgaqayldhvwdegcavvhlpespkpgptgapraargqmligpdgrlirslgqaseadpsgvasiacsscvravdgkavcgqceralcgqcvrtcwgcgsvactlcglvdcsdmyekvlctscamfet igpdgr]
[tr] [A0A1B1L9R9] [a0a1b1l9r9_bactu mnkqlflaslketqksilsyacgaalylwlliwifpsmvsakglneliaampdsvkkivgmespiqnvmdflageyysllfiiiltifcvtvathliarhvdkgamayllatpvsrvqiaitqatvlilglliivsvtyvaglvgaewflqdnnlnkelflkinivggliflvvsaysfffscicnderkalsysasltilffvldmvgklsdklewmknlslftlfrpkeiaegayniwpvsigliagalcifivaivvfkkrdlpl nkelflkinivggliflvvsaysfffscicnderkalsysasltilffvldmvgklsdklewm]

如果只希望零件在管道之后但在空格之前-并且格式一致-

declare -l three
while IFS='| ' read -r one two three four
do echo "[$one] [$two] [$three] [$four]"
done < infile
[sp] [O15304] [siva_human] [MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET IGPDGR]
[tr] [A0A1B1L9R9] [a0a1b1l9r9_bactu] [MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL NKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWM]

如果只需要小写的空格后的LAST位,那么默认分隔符就可以了。

declare -l three
while read -r one two three
do echo "[$one] [$two] [$three]"
done < infile
[sp|O15304|SIVA_HUMAN] [MPKRSCPFADVAPLQLKVRVSQRELSRGVCAERYSQEVFEKTKRLLFLGAQAYLDHVWDEGCAVVHLPESPKPGPTGAPRAARGQMLIGPDGRLIRSLGQASEADPSGVASIACSSCVRAVDGKAVCGQCERALCGQCVRTCWGCGSVACTLCGLVDCSDMYEKVLCTSCAMFET] [igpdgr]
[tr|A0A1B1L9R9|A0A1B1L9R9_BACTU] [MNKQLFLASLKETQKSILSYACGAALYLWLLIWIFPSMVSAKGLNELIAAMPDSVKKIVGMESPIQNVMDFLAGEYYSLLFIIILTIFCVTVATHLIARHVDKGAMAYLLATPVSRVQIAITQATVLILGLLIIVSVTYVAGLVGAEWFLQDNNLNKELFLKINIVGGLIFLVVSAYSFFFSCICNDERKALSYSASLTILFFVLDMVGKLSDKLEWMKNLSLFTLFRPKEIAEGAYNIWPVSIGLIAGALCIFIVAIVVFKKRDLPL] [nkelflkinivggliflvvsaysfffscicnderkalsysasltilffvldmvgklsdklewm]

答案 2 :(得分:1)

所以问题是如何正确使用第三字段作为模式在字符串的其余部分中做子,以及如何将联接的输出发送到awk命令。请注意,如果字段3例如是gsub,则gsub应该有一个目标。一个字符,它也将匹配并替换$ 1中的任何内容。

join df1.txt df2.txt | awk '{gsub($3, tolower($3), $2) ; print $1 "\t" $2}'

显示一个示例,带有和不带有目标:

ian@orca:~/tmp$ cat t
sp|O15304|SIVA_HUMAN FALALALALA A

ian@orca:~/tmp$ awk '{gsub($3, tolower($3)) ; print $1 "\t" $2}' t
sp|O15304|SIVa_HUMaN    FaLaLaLaLa

ian@orca:~/tmp$ awk '{gsub($3, tolower($3), $2) ; print $1 "\t" $2}' t
sp|O15304|SIVA_HUMAN    FaLaLaLaLa

答案 3 :(得分:0)

 sed -rn 's/(.*\s.*\s)(.*)$/\1 \L\2 /p' tmp.txt

来源:

说明:

我不太了解awk,很可能也可以通过awk来做到这一点。 sed独立使用每一行,并且:

's/    substitutes
(      a group
  .*     containing any characters of any amount
  \s     a whitespace
  .*     again some characters
  \s     again a whitespace
)      and stores that group as \1
(.*)   and puts all the remaining characters in group \2
$      until the end of the line
/      Substitute all of this with:
\1     The first group
       a space (you might not want that. then remove it.
\L\2   The second group in lowercase
/p     and print that

必须使用标志-r才能捕获组。 -n标志指示sed自身不要已经预填充每行。

在cygwin上测试。也许您需要在操作系统上使用-e标志。也许您需要使用符合POSIX的[[:space:]]而不是\s来获取空白。

答案 4 :(得分:-2)

尝试这样的事情:

cat text.txt | cut -d"|" -f3