我有从.csv文件导入的数据。第一列包含在括号内包含文本的字符串。数据如下:
symbol
___________________________________________
1 | Apollo Senior Floating Rate Fund Inc. (AFT)
2 | Apollo Tactical Income Fund Inc. (AIF)
3 | Altra Industrial Motion Corp. (AIMC)
4 | Allegion plc (ALLE)
5 | Amphenol Corporation (APH)
6 | Ares Management Corporation (ARES)
7 | ARMOUR Residential REIT, Inc. (ARR)
8 | Banc of California, Inc. (BANC)
9 | BlackRock Resources (BCX)
10| Belden Inc (BDC)
...
我需要将该数据列转换为一个列表,例如:
symbol2
___________________________________________
1 | AFT
2 | AIF
3 | AIMC
4 | ALLE
5 | APH
6 | ARES
7 | ARR
8 | BANC
9 | BCX
10| BDC
...
我的最终目标是获得一个字符串,其中用括号括起来的文本用“;”分隔。像这样:
"AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC;..."
我可以使用
完成最后一步paste(symbol2, collapes = ";")
但是我不知道如何隔离所需的文本。
我尝试通过将“:”替换为“(”来尝试此处(extract a substring in R according to a pattern)中列出的所有内容,但无法正常工作。我尝试了:
gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", symbol, perl=T)
如此处的建议(Extract text in parentheses in R),但输出为
"c(4, 5, 2, 1, 3, 6, 7, 8, 17, 9,...)"
有帮助吗?
答案 0 :(得分:1)
我们可以使用str_extract
中的stringr
提取内容
library(stringr)
symbol2 <- str_extract(df$symbol, "(?<=\\().+?(?=\\))")
symbol2
#[1] "AFT" "AIF" "AIMC" "ALLE" "APH" "ARES"
取自here的正则表达式。
然后您可以将它们paste
一起
paste(symbol2, collapse = ";")
#[1] "AFT;AIF;AIMC;ALLE;APH;ARES"
答案 1 :(得分:1)
这里是使用基数R的sub
和捕获组的选项
df$symbol2 <- sub(".+\\((\\w+)\\)$", "\\1", df$V1)
df
# V1 symbol2
#1 Apollo Senior Floating Rate Fund Inc. (AFT) AFT
#2 Apollo Tactical Income Fund Inc. (AIF) AIF
#3 Altra Industrial Motion Corp. (AIMC) AIMC
#4 Allegion plc (ALLE) ALLE
#5 Amphenol Corporation (APH) APH
#6 Ares Management Corporation (ARES) ARES
#7 ARMOUR Residential REIT, Inc. (ARR) ARR
#8 Banc of California, Inc. (BANC) BANC
#9 BlackRock Resources (BCX) BCX
#10 Belden Inc (BDC) BDC
df <- read.table(text =
"'Apollo Senior Floating Rate Fund Inc. (AFT)'
'Apollo Tactical Income Fund Inc. (AIF)'
'Altra Industrial Motion Corp. (AIMC)'
'Allegion plc (ALLE)'
'Amphenol Corporation (APH)'
'Ares Management Corporation (ARES)'
'ARMOUR Residential REIT, Inc. (ARR)'
'Banc of California, Inc. (BANC)'
'BlackRock Resources (BCX)'
'Belden Inc (BDC)'", header = F)
答案 2 :(得分:1)
1)read.table (读表)将read.table
与指示的sep
和comment
值一起使用以获取2列数据帧,其中第一列是名称,第二列是符号。最后,选择第二列并将其折叠为单个字符串。不使用包或正则表达式。
DF2 <- read.table(text = unlist(DF), sep = "(", comment = ")")
paste(DF2[[2]], collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
2)dplyr 我们可以使用tidyr中的separate
来分隔名称和符号列,同时删除名称列。 unlist
,然后将其折叠为单个字符串。必须使用tidyr 0.8.2或更高版本。
library(dplyr)
library(tidyr)
DF %>%
separate(symbol, c(NA, "symbol2"), "[()]", extra = "drop") %>%
unlist %>%
paste(collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
3)gsub 我们可以匹配(包括".*\\("
及以后的所有内容,即"\\).*"
,并用空字符串替换。然后像以前一样崩溃。
paste(gsub(".*\\(|\\).*", "", DF$symbol), collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
4)修剪这是另一个基本解决方案。它需要R 3.6.0或更高版本(当前为r-devel)。我们将空格定义为除括号之外的任何空格,并使用trimws
来将其修剪掉。然后,将空格定义为括号,然后将其修剪掉。剩下我们现在可以折叠的符号。
paste(trimws(trimws(DF$symbol, white = "[^()]"), white = "[()]"), collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
可重复输入的形式是:
Lines <- "
symbol
1 | Apollo Senior Floating Rate Fund Inc. (AFT)
2 | Apollo Tactical Income Fund Inc. (AIF)
3 | Altra Industrial Motion Corp. (AIMC)
4 | Allegion plc (ALLE)
5 | Amphenol Corporation (APH)
6 | Ares Management Corporation (ARES)
7 | ARMOUR Residential REIT, Inc. (ARR)
8 | Banc of California, Inc. (BANC)
9 | BlackRock Resources (BCX)
10| Belden Inc (BDC)"
DF <- read.table(text = Lines, sep = "|", strip.white = TRUE, as.is = TRUE)