执行此read.table时,有一些值无法正确导入:
hs.industry <- read.table("https://download.bls.gov/pub/time.series/hs/hs.industry", header = TRUE, fill = TRUE, sep = "\t", quote = "", stringsAsFactors = FALSE)
具体而言,有一些值,其中industry_code和industry_name作为行业代码列中的单个值连接(不确定原因)。鉴于每个industry_code都是4位数,我的分裂和纠正方法是:
for (i in 1:nrow(hs.industry)) {
if (isTRUE(nchar(hs.industry$industry_code[i]) > 4)) {
hs.industry$industry_name[i] <- gsub("[[:digit:]]","",hs.industry$industry_code[i])
hs.industry$industry_code[i] <- gsub("[^0-9]", "",hs.industry$industry_code[i])
}
}
我觉得这非常华丽,但我不确定哪种方法会更好。
谢谢!
答案 0 :(得分:4)
问题是第29行和第30行(第28行和第29行,如果我们不计算标题)会出现格式错误。它们使用4个空格而不是正确的制表符。需要一些额外的数据清理。
使用readLines
读取原始文本,更正格式错误,然后读入已清理的表格:
# read in each line of the file as a list of character elements
hs.industry <- readLines('https://download.bls.gov/pub/time.series/hs/hs.industry')
# replace any instances of 4 spaces with a tab character
hs.industry <- gsub('\\W{4,}', '\t', hs.industry)
# collapse together the list, with each line separated by a return character (\n)
hs.industry <- paste(hs.industry, collapse = '\n')
# read in the new table
hs.industry <- read.table(text = hs.industry, sep = '\t', header = T, quote = '')
答案 1 :(得分:1)
您不必循环遍历每个实例,而只识别那些有问题的条目,并仅gsub这些条目:
replace_indx <- which(nchar(hs.industry$industry_code) > 4)
hs.industry$industry_name[replace_indx] <- gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx])
hs.industry$industry_code[replace_indx] <- gsub("\\D+", "", hs.industry$industry_code[replace_indx])
我还使用"\\d+\\s+"
来改进字符串替换,这里我也替换了空格:
gsub("[[:digit:]]","",hs.industry$industry_code[replace_indx])
# [1] " Dimension stone" " Crushed and broken stone"
gsub("\\d+\\s+", "", hs.industry$industry_code[replace_indx])
# [1] "Dimension stone" "Crushed and broken stone"