如何在斜杠前后提取字符串

时间:2017-06-08 11:39:45

标签: r regex

我正在研究R中的词性标记。我有一个字符串,其下面的部分语音(格式:Word / POS_Tag)。我想在一列中提取单词,在其他列中提取相应的词性标记,并在数据集的第3列中提取频率。 此外,我需要在加载到数据集之前从文本中删除任何标点符号或特殊字符。我对RegEx不太熟悉。你能帮我解决这个问题。

(FYR。申请POS_tag之前的句子。 - >“我喜欢下周参加网球锦标赛,我将参加比赛。我喜欢弹吉他。”)

示例:"I/PRP like/IN to/TO play/VB tennis/NN tournament/NN Next/JJ week/NN ,/, and/CC I/PRP will/MD participate/VB on/IN a/DT play/NN ./. I/PRP like/IN playing/VBG guitar/NN ./."

注意:在上面的例子中,我们有3次出现'I'和2次出现'like'。我需要数据集中的单词数量以及下面的数量。

Word           POS_Tag    Count
I              PRP        3
like           IN         2
to             TO         1
play           VB         1
tennis         NN         1
tournament     NN         1
Next           JJ         1
week           NN         1
and            CC         1
will           MD         1
partcipate     VB         1
on             IN         1
a              DT         1
play           NN         1
playing        VBG        1
guitar         NN         1

感谢。

1 个答案:

答案 0 :(得分:2)

我们使用\\w+(来自str_extract_all)从字符串中提取单词{stringr),然后创建一个data.table,其中包含来自{的替代单词的两列{1}}(' v1'),按' Word'和' POS_Tag'分组,获取元素数量(vector

.N

我们也可以使用library(stringr) library(data.table) v1 <- str_extract_all(str1, "\\w+")[[1]] data.table(Word = v1[c(TRUE, FALSE)], POS_Tag = v1[c(FALSE, TRUE)])[ , .(Count = .N), .(Word, POS_Tag)] # Word POS_Tag Count # 1: I PRP 3 # 2: like IN 2 # 3: to TO 1 # 4: play VB 1 # 5: tennis NN 1 # 6: tournament NN 1 # 7: Next JJ 1 # 8: week NN 1 # 9: and CC 1 #10: will MD 1 #11: participate VB 1 #12: on IN 1 #13: a DT 1 #14: play NN 1 #15: playing VBG 1 #16: guitar NN 1

执行此操作
tidyverse

数据

library(tidyverse)
data_frame(string = str1) %>%
       separate_rows(string) %>% 
       group_by(grp = rep(c("Word", "POS_Tag"), length.out = n())) %>% 
       mutate(i1 = row_number()) %>%
       spread(grp, string) %>% select(-i1) %>% 
       count(Word, POS_Tag) %>%
       filter(Word != ".")
# A tibble: 16 x 3
#          Word POS_Tag     n
#         <chr>   <chr> <int>
# 1           a      DT     1
# 2         and      CC     1
# 3      guitar      NN     1
# 4           I     PRP     3
# 5        like      IN     2
# 6        Next      JJ     1
# 7          on      IN     1
# 8 participate      VB     1
# 9        play      NN     1
#10        play      VB     1
#11     playing     VBG     1
#12      tennis      NN     1
#13          to      TO     1
#14  tournament      NN     1
#15        week      NN     1
#16        will      MD     1