Question

我正在尝试从议会议定书中清除文本。由于数据源自pdf文件，因此它们包含具有立法期的页脚和页面引用，例如：“第18立法期N页x”。由于所有600个协议的页面总数都不相同，因此我无法匹配精确的表达式。相反，我想使用gsub函数删除页脚的开头和接下来的n个单词。

我研究了针对其他问题的许多解决方案，这些解决方案朝着相似的方向发展，但无法使其发挥作用。

string <- "this is the first page. 18th legislative period page 1 of 44 
this is the second page. 18th legislative period page 2 of 44 and this is 
the third page"

gsub("18th legislative period page", "", string)

我希望字符串读取

"this is the first page. this is the second page. and this is the third page."

编辑：非常感谢您的时间和耐心！

Answer 1

您可以使用

gsub("18th legislative period page \\d+ of \\d+", "", string)
# or without the newline symbol '\n'
gsub('\\s{2,}', ' ', gsub("18th legislative period page \\d+ of \\d+", "", string))

使用gsub替换字符串并跟随n个单词

1 个答案: