通过R从字符串中提取数据

时间:2018-04-29 10:59:16

标签: r stringr

我有一个数据框,其中包含列的第一类。下面提到的类别列中的数据:

Application Platforms|Real Time|Social Network Media
Apps|Games|Mobile
Curated Web
Software
Games
Biotechnology
Analytics
Mobile
E-Commerce
Entertainment|Games|Software
Networking|Real Estate|Web Hosting

类别列表是由管道(竖线|)分隔的多个子扇区的列表。我想提取主要扇区,它是垂直条(“|”)之前的第一个字符串。

这意味着我希望输出应该是,

Application Platforms
Apps
Curated Web
Software
Games
Biotechnology
Analytics
Mobile
E-Commerce
Entertainment
Networking

请帮助我如何通过使用任何功能来实现这一点,我尝试过使用stringr包函数。

3 个答案:

答案 0 :(得分:2)

我们可以在这里使用sub

df$category <- sub("^([^|]+).*", "\\1", df$category)

以下是另一种不使用捕获组的变体:

df$category <- sub("\\|.*", "", df$category)

Demo

答案 1 :(得分:2)

使用strsplit

category1 <- strsplit(df$category, "|", fixed = TRUE)
df$category <- sapply(category1, `[[`, 1)     # or, purrr::map_chr(category1, 1)

我认为这个解决方案比使用sub更清楚你的意图。然后,它需要额外的一行。

答案 2 :(得分:1)

或使用stringr ...

str_match("Application Platforms|Real Time|Social Network Media",
       "^(.+?)[|$]")[,2] #match start of string up to first | or end or string

[1] "Application Platforms"

...或

str_replace("Application Platforms|Real Time|Social Network Media",
       "\\|.+$","") #replace | and any subsequent characters with ""

[1] "Application Platforms"

...或

str_extract("Application Platforms|Real Time|Social Network Media",
       "[^|]+") #extract first sequence of characters that are not a |

[1] "Application Platforms"

...或

str_split_fixed("Application Platforms|Real Time|Social Network Media",
       "\\|",2)[,1] #split at first | and take the first section

[1] "Application Platforms"