我正在尝试“清理”我为项目下载的一些维基百科文本。我的问题是文本充满了“噪音”,我想像数据表一样将其删除。
这是我要解析的文本的一部分:
test =
South Midland \'arm\' and \'barb\' rhyming with \'form\' and \'orb.\' Unique
words in Alabama English include: redworm (earthworm), peckerwood
(woodpecker), snake doctor and snake feeder (dragonfly), tow sack (burlap
bag), plum peach (clingstone), French harp (harmonica), and dog irons
(andirons).<ref name="city-data.com"/>',
'',
'{|class="wikitable sortable" style="margin-left:1em; float:center"',
"|+ '''Top 10 Non-English Languages Spoken in Alabama'''",
'|-',
'! Language !! Percentage of population<br /><small>({{as of|2010|lc=y}})
</small><ref>{{cite web|url=http://www.city-data.com/states/Alabama-
Languages.html"|title=Alabama – Languages|work=city-data.com|accessdate=July
21, 2015}}</ref>',
'|-',
'| Spanish|| 2.2%',
'|-',
'| German || 0.4%',
'|-',
'| French (incl. Patois, Cajun) || 0.3%',
'|-',
'| Chinese, [[Vietnamese language|Vietnamese]], [[Korean language|Korean]],
[[Arabic language|Arabic]], [[African languages]], Japanese, and Italian
(tied)|| 0.1%',
'|}',
'',
我想从文本中删除数据表,它们用{|分隔。和|}
我研究并尝试使用正则表达式,并提出了以下建议:
re.sub(r'\{|(.*?)\|}', '', test)
但这似乎只是删除定界符本身,而不是删除两者之间的所有内容。
有人可以帮我在这里学习:)