我一直在尝试将数据框的一列拆分为3个单独的列。我能够成功分割到2个所需的列,但我无法提取日期(仅限年份)。 这是我用来将列拆分为我想要的列中的2个的代码:
wines$Winery <- lapply(strsplit(as.character(wines$wine), "[0-9]{4}"), "[", 1)
wines$Name <- lapply(strsplit(as.character(wines$wine), "[0-9]{4}"), "[", 2)
我尝试使用gsub来删除所有非数字字符,但是有些数字字符我不想抓住。我想要的只是列中心的4位数年份,而且并非所有行都列出了一年。
# winery wine
# 1 Charles Smith Charles Smith 2012 Royal City Syrah
# 2 K Vintners K Vintners 2012 Cattle King Syrah
# 3 K Vintners K Vintners 2012 Klein Syrah
# 4 Two Vintners Two Vintners 2013 Make Haste Cinsault
# 5 K Vintners K Vintners 2012 The Hidden Syrah
# 6 Kerloo Kerloo 2013 Stone Tree Malbec
# 7 Betz Family Betz Family 2012 Le Parrain Cabernet Sauvignon
# 8 Kerloo Kerloo 2013 Stone Tree Vineyard Cabernet Sauvignon
# 9 Efeste Efeste 2012 Big Papa Cabernet Sauvignon
# 10 Two Vintners Two Vintners 2013 Boushey Vineyard Orenache
# 11 K Vintners K Vintners 2012 Morrison Lane Syrah
# 12 K Vintners K Vintners 2012 The Creator Red
这些数据是通过网页抓取收集的,所以我已经包含了数据外观的图像,但有超过1000行
数据
wines <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "winery wine
'Charles Smith' 'Charles Smith 2012 Royal City Syrah'
'K Vintners' 'K Vintners 2012 Cattle King Syrah'
'K Vintners' 'K Vintners 2012 Klein Syrah'
'Two Vintners' 'Two Vintners 2013 Make Haste Cinsault'
'K Vintners' 'K Vintners 2012 The Hidden Syrah'
Kerloo 'Kerloo 2013 Stone Tree Malbec'
'Betz Family' 'Betz Family 2012 Le Parrain Cabernet Sauvignon'
Kerloo 'Kerloo 2013 Stone Tree Vineyard Cabernet Sauvignon'
Efeste 'Efeste 2012 Big Papa Cabernet Sauvignon'
'Two Vintners' 'Two Vintners 2013 Boushey Vineyard Orenache'
'K Vintners' 'K Vintners 2012 Morrison Lane Syrah'
'K Vintners' 'K Vintners 2012 The Creator Red'")
答案 0 :(得分:2)
wines <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "winery wine
'Charles Smith' 'Charles Smith 2012 Royal City Syrah'
'K Vintners' 'K Vintners 2012 Cattle King Syrah'
'K Vintners' 'K Vintners 2012 Klein Syrah'
'Two Vintners' 'Two Vintners 2013 Make Haste Cinsault'
'K Vintners' 'K Vintners 2012 The Hidden Syrah'
Kerloo 'Kerloo 2013 Stone Tree Malbec'
'Betz Family' 'Betz Family 2012 Le Parrain Cabernet Sauvignon'
Kerloo 'Kerloo 2013 Stone Tree Vineyard Cabernet Sauvignon'
Efeste 'Efeste 2012 Big Papa Cabernet Sauvignon'
'Two Vintners' 'Two Vintners 2013 Boushey Vineyard Orenache'
'K Vintners' 'K Vintners 2012 Morrison Lane Syrah'
'K Vintners' 'K Vintners 2012 The Creator Red'")
要获取日期,您可以分出所有非数字字符
gsub('\\D', '', wines$wine)
# [1] "2012" "2012" "2012" "2013" "2012" "2013" "2012" "2013" "2012" "2013" "2012" "2012"
或者拆分你的字符串
do.call('rbind', strsplit(wines$wine, ' (?=\\d{4})|(?<=\\d{4}) ', perl = TRUE))
# [,1] [,2] [,3]
# [1,] "Charles Smith" "2012" "Royal City Syrah"
# [2,] "K Vintners" "2012" "Cattle King Syrah"
# [3,] "K Vintners" "2012" "Klein Syrah"
# [4,] "Two Vintners" "2013" "Make Haste Cinsault"
# [5,] "K Vintners" "2012" "The Hidden Syrah"
# [6,] "Kerloo" "2013" "Stone Tree Malbec"
# [7,] "Betz Family" "2012" "Le Parrain Cabernet Sauvignon"
# [8,] "Kerloo" "2013" "Stone Tree Vineyard Cabernet Sauvignon"
# [9,] "Efeste" "2012" "Big Papa Cabernet Sauvignon"
# [10,] "Two Vintners" "2013" "Boushey Vineyard Orenache"
# [11,] "K Vintners" "2012" "Morrison Lane Syrah"
# [12,] "K Vintners" "2012" "The Creator Red"
或者一气呵成(基本上和上面一样)
read.csv(text = gsub(' (?=\\d{4})|(?<=\\d{4}) ', ',', wines$wine, perl = TRUE), header = FALSE)
# V1 V2 V3
# 1 Charles Smith 2012 Royal City Syrah
# 2 K Vintners 2012 Cattle King Syrah
# 3 K Vintners 2012 Klein Syrah
# 4 Two Vintners 2013 Make Haste Cinsault
# 5 K Vintners 2012 The Hidden Syrah
# 6 Kerloo 2013 Stone Tree Malbec
# 7 Betz Family 2012 Le Parrain Cabernet Sauvignon
# 8 Kerloo 2013 Stone Tree Vineyard Cabernet Sauvignon
# 9 Efeste 2012 Big Papa Cabernet Sauvignon
# 10 Two Vintners 2013 Boushey Vineyard Orenache
# 11 K Vintners 2012 Morrison Lane Syrah
# 12 K Vintners 2012 The Creator Red
答案 1 :(得分:1)
您可以使用这样的解决方法:使用您在数据中找不到的一些自定义分隔符字符串替换所需的所有子字符串(例如,gsub("\\s*(\\d{4}|\\bNV\\b)\\s*","#-#\\1#-#", wines$wine)
以包含4位数字块和{{ 1}}作为整个单词,请参阅the regex demo),然后使用该模式进行拆分:
NV