我正在研究R中的词性标记。我有一个字符串,其下面的部分语音(格式:Word / POS_Tag)。我想在一列中提取单词,在其他列中提取相应的词性标记,并在数据集的第3列中提取频率。 此外,我需要在加载到数据集之前从文本中删除任何标点符号或特殊字符。我对RegEx不太熟悉。你能帮我解决这个问题。
(FYR。申请POS_tag之前的句子。 - >“我喜欢下周参加网球锦标赛,我将参加比赛。我喜欢弹吉他。”)
示例:"I/PRP like/IN to/TO play/VB tennis/NN tournament/NN Next/JJ week/NN ,/, and/CC I/PRP will/MD participate/VB on/IN a/DT play/NN ./. I/PRP like/IN playing/VBG guitar/NN ./."
注意:在上面的例子中,我们有3次出现'I'和2次出现'like'。我需要数据集中的单词数量以及下面的数量。
Word POS_Tag Count
I PRP 3
like IN 2
to TO 1
play VB 1
tennis NN 1
tournament NN 1
Next JJ 1
week NN 1
and CC 1
will MD 1
partcipate VB 1
on IN 1
a DT 1
play NN 1
playing VBG 1
guitar NN 1
感谢。
答案 0 :(得分:2)
我们使用\\w+
(来自str_extract_all
)从字符串中提取单词{stringr
),然后创建一个data.table
,其中包含来自{的替代单词的两列{1}}(' v1'),按' Word'和' POS_Tag'分组,获取元素数量(vector
)
.N
我们也可以使用library(stringr)
library(data.table)
v1 <- str_extract_all(str1, "\\w+")[[1]]
data.table(Word = v1[c(TRUE, FALSE)], POS_Tag = v1[c(FALSE, TRUE)])[
, .(Count = .N), .(Word, POS_Tag)]
# Word POS_Tag Count
# 1: I PRP 3
# 2: like IN 2
# 3: to TO 1
# 4: play VB 1
# 5: tennis NN 1
# 6: tournament NN 1
# 7: Next JJ 1
# 8: week NN 1
# 9: and CC 1
#10: will MD 1
#11: participate VB 1
#12: on IN 1
#13: a DT 1
#14: play NN 1
#15: playing VBG 1
#16: guitar NN 1
tidyverse
library(tidyverse)
data_frame(string = str1) %>%
separate_rows(string) %>%
group_by(grp = rep(c("Word", "POS_Tag"), length.out = n())) %>%
mutate(i1 = row_number()) %>%
spread(grp, string) %>% select(-i1) %>%
count(Word, POS_Tag) %>%
filter(Word != ".")
# A tibble: 16 x 3
# Word POS_Tag n
# <chr> <chr> <int>
# 1 a DT 1
# 2 and CC 1
# 3 guitar NN 1
# 4 I PRP 3
# 5 like IN 2
# 6 Next JJ 1
# 7 on IN 1
# 8 participate VB 1
# 9 play NN 1
#10 play VB 1
#11 playing VBG 1
#12 tennis NN 1
#13 to TO 1
#14 tournament NN 1
#15 week NN 1
#16 will MD 1