我正在尝试从数据框中删除某些字词:
name age words
James 34 hello, my name is James.
John 30 hello, my name is John. Here is my favourite website https://stackoverflow.com
Jim 27 Hi! I'm another person whose name begins with a J! Here is something that should be filtered out: <filter>
df<-structure(list(name = structure(c(1L, 3L, 2L), .Label = c("James",
"Jim", "John"), class = "factor"), age = c(34L, 30L, 27L), message = structure(1:3, .Label = c("hello, my name is James. ",
"hello, my name is John. Here is my favourite website https://stackoverflow.com",
"Hi! I'm another person whose name begins with a J! Here is something that should be filtered out: <filter>"
), class = "factor")), .Names = c("name", "age", "message"), class = "data.frame", row.names = c(NA,
-3L))
我正在尝试删除包含http
或filter
匹配的所有字词。
我想迭代每一行,将字符串拆分为空格,然后询问该单词是否包含http
或<filter>
(或其他字)。如果是这样,那么我想用空格替换这个单词。
有一个load of questions有关删除与另一个单词或单词列表完全匹配的单词,但我找不到多少删除符合某些条件的字词(例如http
或www.
)。
我试过了:
gsub
,!grepl
和tm_map
接近(例如this),但我无法将它们中的任何一个产生我预期的输出:
name age words
James 34 hello, my name is James.
John 30 hello, my name is John. Here is my favourite website
Jim 27 Hi! I'm another persoon whose name begins with a J! Here is something that should be filtered out:
答案 0 :(得分:2)
我们可以使用gsub
gsub("\\s(https:\\S+|<filter>)", "", df$message)
答案 1 :(得分:2)
要删除任何包含 1}}使用以下 PCRE 正则表达式(添加http
参数):
filter
请参阅regex demo
<强>详情
gsub
- 1+ wjhitespaces或string of string perl=TRUE
- 尽可能多的非空白字符(?:\s+|^)\S*(?<!\w)(?:https?|<filter>)(?!\w)\S*
- 当前位置的左侧不允许使用字词字符(?:\s+|^)
- \S*
,(?<!\w)
或(?:https?|<filter>)
http
- 当前位置右侧(在交替组中的单词之后)不允许使用单词char https
- 尽可能多的非空白字符。<filter>
结果:
(?!\w)