从文本字符串中删除HTML标签,并保留文本

时间:2019-01-17 04:47:21

标签: r

我有一个文本字符串,如下所示:-

^style>           
  p,span,li{font-family:Arial;font-size:10.5pt;}        
^/style>  
^p>
  ^img src="https://app.keysurvey.com/" alt="image" width="462" />
^/p>  
^p>
  Dear Adam,
^/p>  
^p>
  Thank you for your query, the Reference ID for your query is 
  ^strong>^u> 28600 ^/u>^/strong>
  .  We will respond to you within the next 1-2 business days.
^/p>  
^p>For further correspondence with us, kindly reply by maintaining the 
   Reference ID number of this case in the subject line of your e-mail.
^/p>  
^p>
  Regards
^/p>

我的目标是清除所有html标签和其他垃圾值,并返回如下文本:

输出:-

  

亲爱的亚当,

     

感谢您的查询,您的查询的参考ID是我们将   在接下来的1-2个工作日内回复您。   与我们联系,请通过维护参考ID进行回复   电子邮件主题行中这种情况的编号。请注意

我尝试过tm.plugin.webminingextractHTMLStrip,但无法清除垃圾值

library(tm.plugin.webmining)
df$text1 <- extractHTMLStrip(df$text)

1 个答案:

答案 0 :(得分:0)

如果字符串的破损号小于,则可以使用正则表达式。

yourstring <- '^style> p,span,li{ font-family:Arial; font-size:10.5pt; } ^/style> ^p>^img src="https://app.keysurvey.com/" alt="image" width="462" />^/p> ^p>Dear Adam,^/p> ^p>Thank you for your query, the Reference ID for your query is ^strong>^u> 28600 ^/u>^/strong>.  We will respond to you within the next 1-2 business days.^/p> ^p>For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail.^/p> ^p>Regards'
# reproducible example of your string

yourstring <- gsub("\\^.*?>", "", yourstring)
yourstring <- gsub("p,span.*?}", "", yourstring)
yourstring <- trimws(yourstring)

这会让您:

> yourstring
[1] "Dear Adam, Thank you for your query, the Reference ID for your query is  28600 .  We will respond to you within the next 1-2 business days. For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail. Regards"

要使其更加美观,可以使用stringrmagrittr库。