Question

我正在处理电子邮件语料库，并试图用字符串＆＃39;替换语料库中的所有html标签。如何使用以＆gt;＆lt;开头的事实替换所有html标记？并以＆gt;结束？

示例：

<html>
  <body>
  This is some random text.
 <p>This is some text in a paragraph.</p>
</body>
</html>

应翻译为：

<html>
  <html>
  This is some random text.
    <html>This is some text in a paragraph.<html>
  <html>
<html>

由于

Answer 1

你应该使用gsub的正则表达式的力量。如果您只想在<markup_name>之前替换任何<hml>，那么gsub("<[^>]+>", "<html>", email_text)就会这样做。

技巧[^>]\+扩展（+）正则表达式，直到第一个>（[^>]匹配任何不是>的字符。）< / p>

Answer 2

这是另一种仅为完整性而提供的方法，因为它不如我认为优越的@ Math解决方案更通用。认为人们也可以使用范围量化模式运算符{n,m}。它可能有许多不足之处。它还提出了一个着名的SO答案的记忆：RegEx match open tags except XHTML self-contained tags

 dat <- "<html>
   <body>
   This is some random text.
  <p>This is some text in a paragraph.</p>
 </body>
 </html>"

 gsub("<.{1,5}>", "<html>", dat)
#[1] "<html>\n  <html>\n  This is some random text.\n <html>This is some text in a paragraph.<html>\n<html>\n<html>"

> cat( gsub("<.{1,5}>", "<html>", dat) )
<html>
  <html>
  This is some random text.
 <html>This is some text in a paragraph.<html>
<html>
<html>

替换在R中以某个字符开头和结尾的所有单词

2 个答案: