如何从文本中删除发音指南?

时间:2017-07-03 05:55:09

标签: text awk sed normalization wikipedia

我正在处理来自维基百科的大量文本,我想删除条目中包含的各种发音指南。例如,给出以下条目:

Sigmund Freud (/ˈfrɔɪd/ FROYD; German: [ˈziːkmʊnt ˈfʁɔʏt]; born Sigismund Schlomo Freud; 6 May 1856 – 23 September 1939) was an…
Plato (/ˈpleɪtoʊ/; Greek: Πλάτων Plátōn, pronounced [plá.tɔːn] in Classical Attic; 428/427 or 424/423 – 348/347 BC) was a…
Napoleon Bonaparte (/nəˈpoʊliən ˈboʊnəpɑːrt/; French: [napɔleɔ̃ bɔnapaʁt]; 15 August 1769 – 5 May 1821) was a…
Michael Faraday FRS (/ˈfæ.rəˌdeɪ/; 22 September 1791 – 25 August 1867) was an…
Martin Luther (/ˈluːθər/; German: [ˈmaɐ̯tiːn ˈlʊtɐ]; 10 November 1483 – 18 February 1546), O.S.A., was a…
Louis Pasteur (/ˈluːi pæˈstɜːr/, French: [lwi pastœʁ]; December 27, 1822 – September 28, 1895) was a…

理想情况下,我希望最终得到以下内容:

Sigmund Freud (born Sigismund Schlomo Freud; 6 May 1856 – 23 September 1939) was an…
Plato (428/427 or 424/423 – 348/347 BC) was a…
Napoleon Bonaparte (15 August 1769 – 5 May 1821) was a…
Michael Faraday FRS (22 September 1791 – 25 August 1867) was an…
Martin Luther (10 November 1483 – 18 February 1546), O.S.A., was a…
Louis Pasteur (December 27, 1822 – September 28, 1895) was a…

有没有一种程序化的方法呢?

1 个答案:

答案 0 :(得分:2)

sed 解决方案:

sed 's|/[^/]*/[^,;]*[,;]\(.*\[[^][]*\][^;]*;\)* *||g' file

输出:

Sigmund Freud (born Sigismund Schlomo Freud; 6 May 1856 – 23 September 1939) was an…
Plato (428/427 or 424/423 – 348/347 BC) was a…
Napoleon Bonaparte (15 August 1769 – 5 May 1821) was a…
Michael Faraday FRS (22 September 1791 – 25 August 1867) was an…
Martin Luther (10 November 1483 – 18 February 1546), O.S.A., was a…
Louis Pasteur (December 27, 1822 – September 28, 1895) was a…
  • /[^/]*/[^,;]*[,;] - 将发音部分/.../与可选的后续字词[^,;]*匹配,并以,;结尾

  • \(.*\[[^][]*\][^;]*;\)* - 匹配发音部分[...],其中包含可选字词(由.*[^;]*提供)并以{结尾{1}}。所有这些匹配都标记为可选;