这就是我的文本文件:
1241105.41129.97Y317052.03
2282165.61187.63N364051.40
2251175.87190.72Y366447.49
2243125.88150.81N276045.45
328192.89117.68Y295050.51
2211140.81165.77N346053.11
1291125.61160.61Y335048.3
3273127.73148.76Y320048.04
2191132.22156.94N336051.38
3221118.73161.03Y349349.5
2341189.01200.31Y360048.02
1253144.45180.96N305051.51
2251125.19152.75N305052.72
2192137.82172.25N240046.96
3351140.96174.85N394048.09
1233135.08173.36Y265049.82
1201112.59140.75N380051.25
2202128.19159.73N307048.29
2192132.82172.25Y240046.96
3351148.96174.85Y394048.09
1233132.08173.36N265049.82
1231114.59140.75Y380051.25
3442128.19159.73Y307048.29
2323179.18191.27N321041.12
所有这些值都是连续的,每个字符都表示一些东西。我无法弄清楚如何将每个值分成列,并为将要创建的所有这些新列指定标题。
我使用了这段代码,但它似乎不起作用。
birthweight <- read.table("birthweighthw1.txt", sep="", col.names=c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth”))
任何帮助都将不胜感激。
答案 0 :(得分:0)
假设您对每个列都有明确的定义,您可以使用正则表达式立即解决此问题。
从列名和示例数据中,我猜想匹配每个字段的正则表达式是:
ethnic: \d{1}
age: \d{1,2}
smoke: \d{1}
preweight: \d{3}\.\d{2}
delweight: \d{3}\.\d{2}
breastfed: Y|N
brthwght: \d{3}
brthlngth: \d{3}\.\d{1,2}
我们可以将所有这些放在一个正则表达式中,捕获每个字段
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
注意:在R中,我们需要scape&#34; \&#34;这就是为什么我们写\ d而不是\ d。
也就是说,这里有解决问题的代码。
首先,你需要阅读你的字符串
lines <- readLines("birthweighthw1.txt")
现在,我们定义正则表达式并使用包str_match
中的函数stringr
将数据转换为字符矩阵。
require(stringr)
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
captured <- str_match(string= lines, pattern= reg.expression)
您可以检查矩阵中的第一列是否包含匹配的文本,以及以下列中捕获的数据。所以,我们可以摆脱第一列
captured <- captured[,-1]
并将其转换为具有适当列名
的data.frameresult <- as.data.frame(captured,stringsAsFactors = FALSE)
names(result) <- c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth")
现在,结果中的每一列都是字符类型,您可以将它们中的每一列转换为其他类型。例如:
require(dplyr)
result <- result %>% mutate(ethnic=as.factor(ethnic),
age=as.integer(age),
smoke=as.factor(smoke),
preweight=as.numeric(preweight),
delweight=as.numeric(delweight),
breastfed=as.factor(breastfed),
brthwght=as.integer(brthwght),
brthlngth=as.numeric(brthlngth)
)