我如何在a .txt file on the web上读取此文件并将内容转换为.csv文件? (无论是python还是R,都可以。)
page = readLines('https://www.nass.usda.gov/Data_and_Statistics/County_Data_Files/Frequently_Asked_Questions/county_list.txt')
page <- page[13:4079]
df <- data.frame(matrix(ncol = 5, nrow = 4067))
col_names = c("state", "district", "county", "state_county", "history")
colnames(df) <- col_names
for (row_count in 1:4067){
df[row_count, 1] = unlist(strsplit(page[row_count], " "))[1]
df[row_count, 2] = unlist(strsplit(page[row_count], " "))[4]
df[row_count, 3] = unlist(strsplit(page[row_count], " "))[7]
df[row_count, 4] = unlist(strsplit(unlist(strsplit(page[row_count],
" "))[10], "\t"))[1]
df[row_count, 5] = unlist(strsplit(unlist(strsplit(page[row_count],
" "))[10], "\t"))[7]}
第四列中的某些名称由多个单词组成,第四列与第五列之间的间距大小不同,这使我的代码无法正常工作!
答案 0 :(得分:2)
这是一个base
解决方案
dataStart <- min(which(grepl('^\\d+\\s+\\d+\\s+\\d+\\s+\\w+\\s+\\d$', page, perl = TRUE)))
pageDat <- page[dataStart:length(page)]
pageDat <- gsub("\\s{2,}", ";", pageDat, perl = TRUE)
pageDat <- do.call(rbind, strsplit(pageDat, split = ";"))
# yields
head(pageDat)
[,1] [,2] [,3] [,4] [,5]
[1,] "01" "00" "000" "Alabama" "1"
[2,] "01" "10" "033" "Colbert" "1"
[3,] "01" "10" "057" "Fayette" "2"
[4,] "01" "10" "059" "Franklin" "1"
[5,] "01" "10" "075" "Lamar" "2"
[6,] "01" "10" "077" "Lauderdale" "1"
其中page
如上所述。
答案 1 :(得分:2)
另一种基本解决方案,比@ nate.edwinton(来源中的评论)更罗word:
# read the file as fixed field width file
page <- read.fwf( "https://www.nass.usda.gov/Data_and_Statistics/County_Data_Files/Frequently_Asked_Questions/county_list.txt",
widths = c( 5, 5, 6, 45 ), skip = 12, sep = "^" )
# remove the last line containing NAs
page <- page[ -length( page[,1] ), ]
# convert factors to character
page[ , 4 ] <- as.character( page[ , 4 ] )
# the last character of the fourth field contains the history, move it to a separate variable
page[ 5 ] <- substr( page[ , 4 ], nchar( page[ , 4 ] ), nchar( page[ , 4 ] ) )
# set the column names
colnames( page )[] <- c( "state", "district", "county", "state_county", "history" )
# remove the history info from the state_county field
page[ , 4 ] <- gsub( "[12]$", "", page[ , 4 ] )
# get rid of the tabs
page[ , 4 ] <- gsub( "\t", "", page[ , 4 ] )
# format the output as in the original file (may be necessary or not)
page[ , 1 ] <- sprintf( "%02i", page[ , 1 ] )
page[ , 2 ] <- sprintf( "%02i", page[ , 2 ] )
page[ , 3 ] <- sprintf( "%03i", page[ , 3 ] )
这给了我们
head( page, 15 )
state district county state_county history
1 01 00 000 Alabama 1
2 01 10 033 Colbert 1
3 01 10 057 Fayette 2
4 01 10 059 Franklin 1
5 01 10 075 Lamar 2
6 01 10 077 Lauderdale 1
7 01 10 079 Lawrence 1
8 01 10 083 Limestone 1
9 01 10 089 Madison 1
10 01 10 093 Marion 1
11 01 10 103 Morgan 1
12 01 10 133 Winston 1
13 01 10 888 D10 Combined Counties 1
14 01 10 999 D10 Northern Valley 1
15 01 20 009 Blount 1
答案 2 :(得分:1)
这样的帮助吗?
library(tidyverse)
url <- "https://www.nass.usda.gov/Data_and_Statistics/County_Data_Files/Frequently_Asked_Questions/county_list.txt"
df <- read_lines(url, skip = 12) %>%
data.frame(col = .) %>%
separate(col, into = paste0("X", 1:5), sep = "\\s{2,}", extra = "drop") %>%
na.omit()
head(df)
# X1 X2 X3 X4 X5
#1 01 00 000 Alabama 1
#2 01 10 033 Colbert 1
#3 01 10 057 Fayette 2
#4 01 10 059 Franklin 1
#5 01 10 075 Lamar 2
#6 01 10 077 Lauderdale 1
说明:
readr::read_lines
逐行读取文件col
的{{1}}列中data.frame
X1...X5