给定一个制表符分隔的文件:
$ head train.txt
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . AT NP-TL NN-TL JJ-TL NN-TL VBD NR AT NN IN NP$ JJ NN NN VBD `` AT NN '' CS DTI NNS VBD NN .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted . AT NN RBR VBD IN NN NNS CS AT NN-TL JJ-TL NN-TL , WDT HVD JJ NN IN AT NN , `` VBZ AT NN CC NNS IN AT NN-TL IN-TL NP-TL '' IN AT NN IN WDT AT NN BEDZ VBN .
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. . AT NP NN NN HVD BEN VBN IN NP-TL JJ-TL NN-TL NN-TL NP NP TO VB NNS IN JJ `` NNS '' IN AT JJ NN WDT BEDZ VBN IN NN-TL NP NP NP .
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' . `` RB AT JJ NN IN JJ NNS BEDZ VBN '' , AT NN VBD , `` IN AT JJ NN IN AT NN , AT NN IN NNS CC AT NN IN DT NN '' .
The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' . AT NN VBD PPS DOD VB CS AP IN NP$ NN CC NN NNS `` BER JJ CC JJ CC RB JJ '' .
It recommended that Fulton legislators act `` to have these laws studied and revised to the end of modernizing and improving them '' . PPS VBD CS NP NNS VB `` TO HV DTS NNS VBN CC VBN IN AT NN IN VBG CC VBG PPO '' .
The grand jury commented on a number of other topics , among them the Atlanta and Fulton County purchasing departments which it said `` are well operated and follow generally accepted practices which inure to the best interest of both governments '' . AT JJ NN VBD IN AT NN IN AP NNS , IN PPO AT NP CC NP-TL NN-TL VBG NNS WDT PPS VBD `` BER QL VBN CC VB RB VBN NNS WDT VB IN AT JJT NN IN ABX NNS '' .
Merger proposed NN-HL VBN-HL
However , the jury said it believes `` these two offices should be combined to achieve greater efficiency and reduce the cost of administration '' . WRB , AT NN VBD PPS VBZ `` DTS CD NNS MD BE VBN TO VB JJR NN CC VB AT NN IN NN '' .
The City Purchasing Department , the jury said , `` is lacking in experienced clerical personnel as a result of city personnel policies '' . AT NN-TL VBG-TL NN-TL , AT NN VBD , `` BEZ VBG IN VBN JJ NNS CS AT NN IN NN NNS NNS '' .
只有第一列(由制表符分隔)很重要,我想从第一列中提取唯一的单词列表(包括标点符号)并输出到文件中。假设单词用空格分隔,即:
$ head train.txt | cut -f1
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .
`` Only a relative handful of such reports was received '' , the jury said , `` considering the widespread interest in the election , the number of voters and the size of this city '' .
The jury said it did find that many of Georgia's registration and election laws `` are outmoded or inadequate and often ambiguous '' .
It recommended that Fulton legislators act `` to have these laws studied and revised to the end of modernizing and improving them '' .
The grand jury commented on a number of other topics , among them the Atlanta and Fulton County purchasing departments which it said `` are well operated and follow generally accepted practices which inure to the best interest of both governments '' .
Merger proposed
However , the jury said it believes `` these two offices should be combined to achieve greater efficiency and reduce the cost of administration '' .
The City Purchasing Department , the jury said , `` is lacking in experienced clerical personnel as a result of city personnel policies '' .
$ head train.txt | cut -f2
AT NP-TL NN-TL JJ-TL NN-TL VBD NR AT NN IN NP$ JJ NN NN VBD `` AT NN '' CS DTI NNS VBD NN .
AT NN RBR VBD IN NN NNS CS AT NN-TL JJ-TL NN-TL , WDT HVD JJ NN IN AT NN , `` VBZ AT NN CC NNS IN AT NN-TL IN-TL NP-TL '' IN AT NN IN WDT AT NN BEDZ VBN .
AT NP NN NN HVD BEN VBN IN NP-TL JJ-TL NN-TL NN-TL NP NP TO VB NNS IN JJ `` NNS '' IN AT JJ NN WDT BEDZ VBN IN NN-TL NP NP NP .
`` RB AT JJ NN IN JJ NNS BEDZ VBN '' , AT NN VBD , `` IN AT JJ NN IN AT NN , AT NN IN NNS CC AT NN IN DT NN '' .
AT NN VBD PPS DOD VB CS AP IN NP$ NN CC NN NNS `` BER JJ CC JJ CC RB JJ '' .
PPS VBD CS NP NNS VB `` TO HV DTS NNS VBN CC VBN IN AT NN IN VBG CC VBG PPO '' .
AT JJ NN VBD IN AT NN IN AP NNS , IN PPO AT NP CC NP-TL NN-TL VBG NNS WDT PPS VBD `` BER QL VBN CC VB RB VBN NNS WDT VB IN AT JJT NN IN ABX NNS '' .
NN-HL VBN-HL
WRB , AT NN VBD PPS VBZ `` DTS CD NNS MD BE VBN TO VB JJR NN CC VB AT NN IN NN '' .
AT NN-TL VBG-TL NN-TL , AT NN VBD , `` BEZ VBG IN VBN JJ NNS CS AT NN IN NN NNS NNS '' .
我可以这样做:
$ python
>>> fout = open('word.dict', 'w')
>>> fout.write('\n'.join(list(set(zip(*[line.split('\t')[0].lower().split() for line in open('train.txt')])[0]))))
>>> exit()
$ head word.dict
trenton
brevet
secondly
fig.
magnetic
doubts
monte
elisabeth
four
facilities
但有没有办法在shell / bash中提取相同的单词列表?
答案 0 :(得分:3)
试试这个:
cut -f1 file | tr -s '[:space:]' '\n' | tr '[:upper:]' '[:lower:]' | sort -u
cut -f1
提取第一个以标签分隔的列
tr -s '[:space:]' '\n'
用换行符替换每一行空格,有效地创建一个单词列表,每个单词都在各自的行上。
tr '[:upper:]' '[:lower:]'
将这些行转换为全小写。
sort -u
对生成的单词列表进行排序,省略重复项(-u
)。
答案 1 :(得分:3)
我无法分析发布的示例输入中的标签位置,因此未经测试但应该按照您的要求进行操作:
awk '{sub(/\t.*/,""); for (i=1; i<=NF; i++) if (!seen[tolower($i)]++) print $i}' file
或者如果你想要小写的所有输出:
awk '{sub(/\t.*/,""); $0=tolower($0); for (i=1; i<=NF; i++) if (!seen[$i]++) print $i}' file