如何awk读字典并替换文件中的单词?

时间:2019-01-20 20:28:31

标签: awk

我们有一个看起来像这样的源文件(“ source-A”)(如果看到蓝色文本,则来自stackoverflow,而不是文本文件):

The container of white spirit was made of aluminium.
We will use an aromatic method to analyse properties of white spirit.
No one drank white spirit at stag night.
Many people think that a potato crisp is savoury, but some would rather eat mashed potato.
...
more sentences

“ source-A”中的每个句子在其单独的行上并以换行符(\ n)结尾

我们有一个字典/转换文件(“ converse-B”),如下所示:

aluminium<tab>aluminum
analyse<tab>analyze
white spirit<tab>mineral spirits
stag night<tab>bachelor party
savoury<tab>savory
potato crisp<tab>potato chip
mashed potato<tab>mashed potatoes

“ converse-B”是两列的制表符分隔文件。 每个等效图(左上项 <tab> 右上项)位于自己的行上,并以换行符(\ n)终止

如何阅读“ converse-B”并替换“ source-A”中的术语,其中“ converse-B”第1列中的术语被替换为第2列中的术语,然后写入输出文件(“输出C”)?

例如,“ output-C”将如下所示:

The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.

棘手的部分是“马铃薯”一词。

如果“简单” awk解决方案不能处理单数项(马铃薯)复数项(马铃薯),我们将使用手动替换方法。 awk解决方案可以跳过该用例。

换句话说,awk解决方案可以规定它仅适用于明确的单词或由空格分隔的明确单词组成的术语。

awk解决方案将使我们的完成率达到90%;我们将手动完成剩余的10%。

1 个答案:

答案 0 :(得分:1)

sed可能更适合,因为它只是短语/单词的替换。请注意,如果相同的单词出现在多个短语中,则先到先得;因此请相应地更改字典顺序。

$ sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' dict) content

The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
...
more sentences

文件替换sed语句将字典条目转换为sed表达式,而主要的sed使用它们来替换内容。

注意:请注意,生产质量脚本应考虑单词大小写以及单词边界,以消除不需要的子字符串替换,此处将忽略它们。