Question

我有一个文本字符串，如下所示：-

^style>           
  p,span,li{font-family:Arial;font-size:10.5pt;}        
^/style>  
^p>
  ^img src="https://app.keysurvey.com/" alt="image" width="462" />
^/p>  
^p>
  Dear Adam,
^/p>  
^p>
  Thank you for your query, the Reference ID for your query is 
  ^strong>^u> 28600 ^/u>^/strong>
  .&nbsp; We will respond to you within the next 1-2 business days.
^/p>  
^p>For further correspondence with us, kindly reply by maintaining the 
   Reference ID number of this case in the subject line of your e-mail.
^/p>  
^p>
  Regards
^/p>

我的目标是清除所有html标签和其他垃圾值，并返回如下文本：

输出：-

亲爱的亚当，

感谢您的查询，您的查询的参考ID是我们将   在接下来的1-2个工作日内回复您。   与我们联系，请通过维护参考ID进行回复   电子邮件主题行中这种情况的编号。请注意

我尝试过tm.plugin.webmining，extractHTMLStrip，但无法清除垃圾值

library(tm.plugin.webmining)
df$text1 <- extractHTMLStrip(df$text)

Answer 1

如果字符串的破损号小于，则可以使用正则表达式。

yourstring <- '^style> p,span,li{ font-family:Arial; font-size:10.5pt; } ^/style> ^p>^img src="https://app.keysurvey.com/" alt="image" width="462" />^/p> ^p>Dear Adam,^/p> ^p>Thank you for your query, the Reference ID for your query is ^strong>^u> 28600 ^/u>^/strong>.  We will respond to you within the next 1-2 business days.^/p> ^p>For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail.^/p> ^p>Regards'
# reproducible example of your string

yourstring <- gsub("\\^.*?>", "", yourstring)
yourstring <- gsub("p,span.*?}", "", yourstring)
yourstring <- trimws(yourstring)

这会让您：

> yourstring
[1] "Dear Adam, Thank you for your query, the Reference ID for your query is  28600 .  We will respond to you within the next 1-2 business days. For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail. Regards"

要使其更加美观，可以使用stringr和magrittr库。

从文本字符串中删除HTML标签，并保留文本

1 个答案: