Question

我正在尝试使用r提取类似'<9f> <98> <82>'的代码，以便使用R和正则表达式获取字符串中的表情符号代码，但由于包含多个<>而无法成功字符串。有人可以提供一些有关如何使用正则表达式进行提取的建议吗？我现在正在使用单独的功能。

例如：对于此字符串'Sooooo ..这发生在<9f> <92> <8d> \ r \ n（我说是）'我希望得到'<9f> <92> <8d>'

将该表达式尝试为“ <*>” 像

column1 <- separate(TwitterData2,text,into = c("text", "Emojicode"), sep = "<*>")

，但结果附在下方（出于凭据目的，用户名被屏蔽）

Answer 1

您可能正在寻找类似以下的内容。我添加了一些额外的案例来说明此解决方案的通用性。 (<.*?>){3,}表示匹配三个或更多连续<.*?>的任何模式，其中?告诉R非贪婪地匹配。重要的是perl = T或{3,}不起作用：

s1 <- 'Sooooo.. this happened <9f><92><8d> \r\n(I said yes) '
regmatches(s1, gregexpr("(<.*?>){3,}", s1, perl = T))

# [[1]]
# [1] "<9f><92><8d>"


s2 <- 'Sooooo.. this happened <9f><92><8d><93> \r\n(I said yes) '
regmatches(s2, gregexpr("(<.*?>){3,}", s2, perl = T))

# [[1]]
# [1] "<9f><92><8d><93>"


s3 <- 'Sooooo.. this happened <9f><92><8d> \r\n(I said yes) <9f><92><8d>'
regmatches(s3, gregexpr("(<.*?>){3,}", s3, perl = T))

# [[1]]
# [1] "<9f><92><8d>" "<9f><92><8d>"

Answer 2

怎么样？

我在答案中使用的相同测试用例@gersht上运行了此正则表达式。

library(stringr)

tststr <- "Sooooo.. this happened <9f><92><8d> \r\n(I said yes)"
str_extract_all(tststr, "(<[0-9a-f]{2}>)+")
# [1] "<9f><92><8d>"

tststr <- "Sooooo.. this happened <9f><92><8d><93> \r\n(I said yes)"
str_extract_all(tststr, "(<[0-9a-f]{2}>)+")
# [[1]]
# [1] "<9f><92><8d><93>"


tststr <- "Sooooo.. this happened <9f><92><8d> \r\n(I said yes) <9f><92><8d>"
str_extract_all(tststr, "(<[0-9a-f]{2}>)+")
# [[1]]
# [1] "<9f><92><8d>" "<9f><92><8d>"

如何使用正则表达式从R

2 个答案: