条件字符串匹配R字符向量折叠选择元素

时间:2019-01-15 23:48:44

标签: r string data-cleaning stringr

我有一个字符向量,我想在其中匹配特定的字符串,然后将包含该字符串匹配的元素 only 与字符向量中的下一个元素折叠起来,然后允许该过程继续进行直到字符向量结束。例如,仅一种情况:

'"FundSponsor:Blackrock Advisors" "Category:"  "Tax-Free Income-Pennsylvania"  "Ticker:"  "MPA" "NAV Ticker:" "XMPAX"                          "Average Daily Volume (shares):" "26,000"                         "Average Daily Volume (USD):"    "$0.335M"                        "Inception Date:"  "10/30/1992" "Inception Share Price:" "$15.00"                         "Inception NAV:" "$14.18" "Tender Offer:" "No"                             "Term:" "No"'   

将包含:的每个元素与仅跟随其后的元素组合在一起将是很棒的,但是我一直在努力使用粘贴功能,因为它通常会将基于:的整个矢量折叠为一个元素,这不是我正在寻找的更有针对性的解决方案。

以下是我希望将部分修改后的输出显示为以下示例:

"Inception Share Price:$15.00"

2 个答案:

答案 0 :(得分:0)

我不确定您是否希望结果成为一个单一的键:值格式,还是只想清理该长字符串并采用以下格式,即键1:值1键2:值2键3:值3。在这种情况下,您可以通过以下代码来实现:

char = '"FundSponsor:Blackrock Advisors" "Category:" "Tax-Free Income-Pennsylvania" "Ticker:" "MPA" "NAV Ticker:" "XMPAX" "Average Daily Volume (shares):" "26,000" "Average Daily Volume (USD):" "$0.335M" "Inception Date:" "10/30/1992" "Inception Share Price:" "$15.00" "Inception NAV:" "$14.18" "Tender Offer:" "No" "Term:" "No"'

char_tidy = gsub('\\" \\"', " ", char)

# output is below
> char_tidy
[1] "\"FundSponsor:Blackrock Advisors Category: Tax-Free Income-Pennsylvania Ticker: MPA NAV Ticker: XMPAX Average Daily Volume (shares): 26,000 Average Daily Volume (USD): $0.335M Inception Date: 10/30/1992 Inception Share Price: $15.00 Inception NAV: $14.18 Tender Offer: No Term: No\""

答案 1 :(得分:0)

以下内容可能会有所帮助:

首先使用strsplit进行拆分,然后将属于一起的元素绑定

# split the string
vec <- unlist(strsplit(string, '(?=\")(?=\")', perl = TRUE))
vec <- vec[! vec %in% c(' ', '\"')]
# that's how vec looks like right now
head(vec)
# [1] "FundSponsor:Blackrock Advisors" "Category:"                      "Tax-Free Income-Pennsylvania"   "Ticker:"                        "MPA"                           
# [6] "NAV Ticker:"    
#
# now paste the elements
ind <- grepl(':.+',vec)
tmp <- vec[!ind]
vec[!ind] <- paste0(tmp[seq(1,length(tmp),2)], tmp[seq(2,length(tmp),2)])
head(vec)
# [1] "FundSponsor:Blackrock Advisors"        "Category:Tax-Free Income-Pennsylvania" "Ticker:MPA"                            "NAV Ticker:XMPAX"                     
# [5] "Average Daily Volume (shares):26,000"  "Average Daily Volume (USD):$0.335M" 

与数据

string = "\"FundSponsor:Blackrock Advisors\" \"Category:\" \"Tax-Free Income-Pennsylvania\" \"Ticker:\" \"MPA\" \"NAV Ticker:\" \"XMPAX\" \"Average Daily Volume (shares):\" \"26,000\" \"Average Daily Volume (USD):\" \"$0.335M\" \"Inception Date:\" \"10/30/1992\" \"Inception Share Price:\" \"$15.00\" \"Inception NAV:\" \"$14.18\" \"Tender Offer:\" \"No\" \"Term:\" \"No\""

说明

  • regex (?=\")(?=\")基本上告诉R每当有两个\"时就拆分字符串。语法(?!*something*)表示*something*在之前/之后。因此,上面的代码简单地写成:\"之前和\" 前面的每个位置处分割字符串。
  • 上面的strsplit(...)创建的格式为\"的元素('\"Category:\" \"...'成为向量'\"';'Category:';'\"';' ';'...')。因此,通过使用! vec %in% c(...),我们可以删除那些不需要的元素。

附录

如果包含格式为"string:"后跟" "的元素,则在上面的代码中删除行vec <- vec[! vec %in% c(' ', '\"')]并添加行

vec <- vec[seq(2L, length(vec), 4L)]
vec[vec == ' '] <- NA_character_