Question

我在except Exception as exc: # catch Exception or it's subclasses only logging.exception(exc) # log for purpose not to miss exception you can fix response = False下运行cygwin

具有一个如下所示的字典文件（windows 10）

：

1-dictionary.txt

它们之间的分隔符是labelling labeling flavour flavor colour color organisations organizations végétales végétales contr?lée contrôlée " "（TAB s）。

字典文件被编码为\t。

想用第二列中的单词和HTML实体替换第一列中的单词和符号。

我的源文件（UTF-8）具有目标UTF-8和ASCII符号。源文件也被编码为2-source.txt。

示例文本如下：

UTF-8

我在Shell脚本（./3-script.sh）中运行以下Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system单行代码：

sed

将sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' 1-dictionary.txt) 2-source.txt > 3-translation.txt中的英语（en-GB）单词替换为美国（en-US）单词成功。

但是，将ASCII符号（例如引号和UTF-8单词）替换会产生以下结果：

3-translation.txt

如果我仅使用特定的符号（而不是完整的单词），则会得到如下结果：

vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)

ASCII引号后面附加vé#x00E9;gé#x00E9;tales "#x0022cultivated"#x0022 contrô#x00F4;lé#x00E9;e-不会被替换。

类似地，UTF-8符号附加了其HTML实体-未被HTML实体替换。

预期输出如下：

&#x0022;

如何修改v#x00E9;g#x00E9;tales #x0022cultivated#x0022 contr#x00F4;l#x00E9;e脚本，以便用字典文件中定义的等效HTML实体替换目标ASCII和UTF-8符号？

Answer 1

我尝试过，只需将&中的所有\&替换为1-dictionary.txt就可以解决您的问题。

Sed的替代词使用regex作为 from 部分，因此当您像这样使用它时，请注意那些正则表达式字符并添加\以使其成为{{ 3}}。

to 部分也将具有特殊字符，主要是\和&，并添加额外的\以使其成为escaped

上面链接到escaped，对于其他sed版本，您也可以选中man sed。

sed如何用HTML实体替换UTF-8字符？

1 个答案: