我有一个如下所示的外部数据文件,没有分隔符:
PLAYER TEAM STUFF1 STUFF2
Jim Smith NYY 100 200
Jerry Johnson Jr. PHI 100 200
Andrew C. James STL 200 200
A. J. Williams CWS 100 200
Felix Rodriguez BAL 100 100
我该如何阅读此文件?我正在考虑使用readLines
并在三个连续大写字母的任何序列之前拆分字符串。但是,我不知道该怎么做。
如果团队名称的第一个字母大写,该怎么办?
下面是一个类似的文件,其中一个名称后跟一列数字。我可以使用下面的代码阅读这些数据:
TEAM STUFF1 STUFF2
New York Yankees 100 200
Philadelphia Phillies 100 200
Boston Red Sox 200 200
Los Angeles Angels 100 200
Chicago White Sox 100 100
Chicago Cubs 200 100
New York Mets 200 200
San Francisco Giants 100 300
Minnesota Twins 100 300
St. Louis Cardinals 200 300
以下是读取第二个数据集的代码:
setwd('c:/users/mmiller21/simple R programs/')
my.data3 <- readLines('team.names.with.spaces.txt')
# split between desired columns
my.data4 <- do.call(rbind, strsplit(my.data3, split = "(?<=[ ])(?=[0-9])", perl = T))
# returns string w/o leading or trailing whitespace
# This function is not mine and was found on Stack Overflow
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
my.data5 <- trim(my.data4)
# remove header
my.data6 <- my.data5[-1,]
# convert to data.frame
my.data6 <- data.frame(my.data6, stringsAsFactors = FALSE)
my.data6[,2] <- as.numeric(my.data6[,2])
my.data6[,3] <- as.numeric(my.data6[,3])
my.data6
X1 X2 X3
1 New York Yankees 100 200
2 Philadelphia Phillies 100 200
3 Boston Red Sox 200 200
4 Los Angeles Angels 100 200
5 Chicago White Sox 100 100
6 Chicago Cubs 200 100
7 New York Mets 200 200
8 San Francisco Giants 100 300
9 Minnesota Twins 100 300
10 St. Louis Cardinals 200 300
感谢您的任何建议。我更喜欢基础R的解决方案。
答案 0 :(得分:1)
这是一个满足您要求的简单解决方案。它基于空格的标记和重建名称。它假定名称是唯一包含多个标记的字段。应该注意的是,间距可能无法完美保留,并且可能无法正确使用嵌入的标签而不是空格:
library(stringr)
lines = readLines("team.names.with.spaces.txt");
for (line in lines[2:length(lines)]) {
toks = strsplit(str_trim(line), " +")[[1]];
ntoks = length(toks);
name = paste(toks[1:(ntoks-3)], collapse=' ');
team = toks[ntoks-2];
num1 = as.integer(toks[ntoks-1]);
num2 = as.integer(toks[ntoks]);
print(line)
print(name)
print(team)
print(num1)
print(num2)
}
我建议使用str_trim()除非你的文件总是干净地构造,在这种情况下你可能会删除stringr依赖。输出如下:
[1] "Jim Smith NYY 100 200"
[1] "Jim Smith"
[1] "NYY"
[1] 100
[1] 200
[1] "Jerry Johnson Jr. PHI 100 200"
[1] "Jerry Johnson Jr."
[1] "PHI"
[1] 100
[1] 200
作为替代方法,您可以使用str_locate()更稳定地处理名称中的多个空格或标点符号(使用逗号的连字符名称):
library(stringr)
x="Jerry Johnson Jr. PHI 100 200"
ndx = str_locate(x," +[A-Z]{3} +[0-9]+ +[0-9]+")[1]
name = substr(x,1,ndx-1);
答案 1 :(得分:0)
这将在连续三个大写字母之前拆分字符串:
setwd('c:/users/mmiller21/simple R programs/')
my.data3 <- readLines('player.names.with.spaces.txt')
strsplit(my.data3, split = "(?<=[ ])(?=[A-Z]{3})", perl = T)
我可以从那里得到其余的。虽然只有团队名称的第一个字母大写,但我仍然对如何阅读文件感兴趣。
以上是上述代码的结果:
[[1]]
[1] "PLAYER " "TEAM " "STUFF1 " "STUFF2"
[[2]]
[1] "Jim Smith " "NYY 100 200"
[[3]]
[1] "Jerry Johnson Jr. " "PHI 100 200"
[[4]]
[1] "Andrew C. James " "STL 200 200"
[[5]]
[1] "A. J. Williams " "CWS 100 200"
[[6]]
[1] "Felix Rodriguez " "BAL 100 100"
如果某些团队名称包含三个大写字母而其他团队名称包含两个大写字母,则可以使用以下数据集:
PLAYER TEAM STUFF1 STUFF2
Jim Smith NYY 100 200
Jerry Johnson Jr. TB 100 200
Andrew C. James STL 200 200
A. J. Williams TB 100 200
Felix Rodriguez CWS 100 100
my.data3 <- readLines('player.names.with.spaces3.txt')
strsplit(my.data3, split = "(?<=[ ])((?=[A-Z]{2})|(?=[A-Z]{3}))", perl = T)
如果团队名称不是全部大写字母,则与此数据集一样:
PLAYER TEAM STUFF1 STUFF2
Jim Smith NYY 100 200
Jerry Johnson Jr. Phi 100 200
Andrew C. James StL 200 200
A. J. Williams CWS 100 200
Felix Rodriguez Bal 100 100
以下代码似乎可以使用多个拆分:
setwd('c:/users/mmiller21/simple R programs/')
my.data3 <- readLines('player.names.with.spaces2.txt')
my.data4 <- strsplit(my.data3, split = "(?<=[ ])(?=[0-9])", perl = T)
my.data5 <- do.call(rbind, my.data4[])
my.data5 <- my.data5[-1,]
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
my.data6 <- trim(my.data5)
my.data7 <- strsplit(my.data6[,1], ' (?=[^ ]+$)', perl=TRUE)
my.data8 <- do.call(rbind, my.data7[])
my.data9 <- trim(my.data8)
my.data10 <- cbind(my.data9, my.data6[,2:3])
my.data10
结果如下:
[,1] [,2] [,3] [,4]
[1,] "Jim Smith" "NYY" "100" "200"
[2,] "Jerry Johnson Jr." "Phi" "100" "200"
[3,] "Andrew C. James" "StL" "200" "200"
[4,] "A. J. Williams" "CWS" "100" "200"
[5,] "Felix Rodriguez" "Bal" "100" "100"