我有一个文本字符串,如下所示:-
^style>
p,span,li{font-family:Arial;font-size:10.5pt;}
^/style>
^p>
^img src="https://app.keysurvey.com/" alt="image" width="462" />
^/p>
^p>
Dear Adam,
^/p>
^p>
Thank you for your query, the Reference ID for your query is
^strong>^u> 28600 ^/u>^/strong>
. We will respond to you within the next 1-2 business days.
^/p>
^p>For further correspondence with us, kindly reply by maintaining the
Reference ID number of this case in the subject line of your e-mail.
^/p>
^p>
Regards
^/p>
我的目标是清除所有html标签和其他垃圾值,并返回如下文本:
输出:-
亲爱的亚当,
感谢您的查询,您的查询的参考ID是我们将 在接下来的1-2个工作日内回复您。 与我们联系,请通过维护参考ID进行回复 电子邮件主题行中这种情况的编号。请注意
我尝试过tm.plugin.webmining
,extractHTMLStrip
,但无法清除垃圾值
library(tm.plugin.webmining)
df$text1 <- extractHTMLStrip(df$text)
答案 0 :(得分:0)
如果字符串的破损号小于,则可以使用正则表达式。
yourstring <- '^style> p,span,li{ font-family:Arial; font-size:10.5pt; } ^/style> ^p>^img src="https://app.keysurvey.com/" alt="image" width="462" />^/p> ^p>Dear Adam,^/p> ^p>Thank you for your query, the Reference ID for your query is ^strong>^u> 28600 ^/u>^/strong>. We will respond to you within the next 1-2 business days.^/p> ^p>For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail.^/p> ^p>Regards'
# reproducible example of your string
yourstring <- gsub("\\^.*?>", "", yourstring)
yourstring <- gsub("p,span.*?}", "", yourstring)
yourstring <- trimws(yourstring)
这会让您:
> yourstring
[1] "Dear Adam, Thank you for your query, the Reference ID for your query is 28600 . We will respond to you within the next 1-2 business days. For further correspondence with us, kindly reply by maintaining the Reference ID number of this case in the subject line of your e-mail. Regards"
要使其更加美观,可以使用stringr
和magrittr
库。