Question

这就是我的文本文件：

1241105.41129.97Y317052.03
2282165.61187.63N364051.40
2251175.87190.72Y366447.49
2243125.88150.81N276045.45
328192.89117.68Y295050.51
2211140.81165.77N346053.11
1291125.61160.61Y335048.3
3273127.73148.76Y320048.04
2191132.22156.94N336051.38
3221118.73161.03Y349349.5
2341189.01200.31Y360048.02
1253144.45180.96N305051.51
2251125.19152.75N305052.72
2192137.82172.25N240046.96
3351140.96174.85N394048.09
1233135.08173.36Y265049.82
1201112.59140.75N380051.25
2202128.19159.73N307048.29
2192132.82172.25Y240046.96
3351148.96174.85Y394048.09
1233132.08173.36N265049.82
1231114.59140.75Y380051.25
3442128.19159.73Y307048.29
2323179.18191.27N321041.12

所有这些值都是连续的，每个字符都表示一些东西。我无法弄清楚如何将每个值分成列，并为将要创建的所有这些新列指定标题。

我使用了这段代码，但它似乎不起作用。

birthweight <- read.table("birthweighthw1.txt", sep="", col.names=c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth”))

任何帮助都将不胜感激。

Answer 1

假设您对每个列都有明确的定义，您可以使用正则表达式立即解决此问题。

从列名和示例数据中，我猜想匹配每个字段的正则表达式是：

ethnic: \d{1}
age: \d{1,2}
smoke: \d{1}
preweight: \d{3}\.\d{2}
delweight: \d{3}\.\d{2}
breastfed: Y|N
brthwght: \d{3}
brthlngth: \d{3}\.\d{1,2}

我们可以将所有这些放在一个正则表达式中，捕获每个字段

reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"

注意：在R中，我们需要scape＆＃34; \＆＃34;这就是为什么我们写\ d而不是\ d。

也就是说，这里有解决问题的代码。

首先，你需要阅读你的字符串

lines <- readLines("birthweighthw1.txt")

现在，我们定义正则表达式并使用包str_match中的函数stringr将数据转换为字符矩阵。

require(stringr)

reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"

captured <- str_match(string= lines, pattern= reg.expression)

您可以检查矩阵中的第一列是否包含匹配的文本，以及以下列中捕获的数据。所以，我们可以摆脱第一列

captured <- captured[,-1]

并将其转换为具有适当列名

的data.frame

result <- as.data.frame(captured,stringsAsFactors = FALSE)

names(result) <- c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth")

现在，结果中的每一列都是字符类型，您可以将它们中的每一列转换为其他类型。例如：

require(dplyr)

result <- result %>% mutate(ethnic=as.factor(ethnic),
                            age=as.integer(age),
                            smoke=as.factor(smoke),
                            preweight=as.numeric(preweight),
                            delweight=as.numeric(delweight),
                            breastfed=as.factor(breastfed),
                            brthwght=as.integer(brthwght),
                            brthlngth=as.numeric(brthlngth)
                            )

如何将文本文件分成列

1 个答案: