删除R中字符串的图像标记

时间:2016-02-01 17:46:34

标签: regex r

这是我在xpathSapply()xml文件到R之后的一段字符串。

test <-
"Nothing screams the holidays quite like a honey baked ham from the infamous < a href=\"http://www.honeybaked.com/\">Honey Baked Ham Co< /a>. This molasses-y monstrous brown sugar encrusted pork dream come true was a staple at my family Christmas dinner table. Did you know that this artery clogging bit of Americana makes plenty of trimmings to pair with that big ole ham? HBH Co. is now serving up some killer casseroles to pair with that sugar-soaked pig-pile you’ve had good and bad dreams about.\n\nNow I know what you’re thinking, what does honeyed ham or a bunch of beautiful casseroles have to do with wine? Quiet thy fluttering heart… I give you the FIVE wine pairings for the perfect Honey Baked Ham Co. Holiday feast!\n\n&nbsp;\n\n<strong>1)</strong>"

我的老板让我计算这个字符串中的单词数量,但我需要先删除图像。

我尝试删除此字符串中的图像:

< a href=\"http://www.honeybaked.com/\">Honey Baked Ham Co < /a>

我是正则表达式的新手并首先尝试unlist(strsplit(test, split = " "))然后grep "< a""a>"的索引,然后删除这两个索引之间的所有内容。 但有没有有效的方法来做到这一点?

1 个答案:

答案 0 :(得分:0)

首先,删除“&lt;”之间和之内的所有内容和“/ a&gt;”与子:

cleared <- sub("<.*/a>", "", test)

然后,只需计算单词:

wordsCount <- length(unlist(strsplit(cleared," ")))