我有一个已写入的.R文件只是通过CSV文件并绘制数据。我试图添加一些线来抓取并输出基于频率的前5行,但我得到奇怪的结果
以下是代码:
require (stringr)
generate_names <-function ( gender, name) {
genderfn<-paste(gender,"_names.csv",sep="",collapse=NULL)
fn <- paste("../datasets/Ontario_names/", genderfn,sep="",collapse = NULL)
A <- read.csv(fn, skip=1, header=TRUE)
print(dim(A))
# Recode the Frequency measurement to be certain it is an integer
A$Frequency <- as.integer(A$Frequency)
#pdf(paste(name, ".pdf",sep="", collapse =NULL))
#generate a logical vector of matching names
g <- stringr::str_trim(A$Name)==toupper(name)
#use the logical vector to create a smaller data frame
name.df <- A[g,]
#my little addition
ordered <- name.df[order(A$Frequency, decreasing = F),]
top5 <- head( ordered, 50)
print(top5)
#plot the distribution of name registrations over years
plot(name.df$Year,name.df$Frequency,
type="p",
main=paste(toupper(name)," in Ontario"),
xlab="Birth Year", ylab = "Number",
xlim=c(min(name.df$Year),max(name.df$Year)),
ylim=c(0,max(name.df$Frequency)) )
#grid()
#dev.off()
}
# Replace the gender and names and try some different names
generate_names("male","grant")
generate_names("female","mary")
输出有点奇怪。这些是两个函数的片段:
> generate_names("male","grant")
[1] 66351 3
Year Name Frequency
26720 1917 GRANT 25
26729 1926 GRANT 36
26733 1930 GRANT 36
26734 1931 GRANT 33
26735 1932 GRANT 36
26737 1934 GRANT 47
26738 1935 GRANT 45
26740 1937 GRANT 43
26741 1938 GRANT 46
26743 1940 GRANT 51
26744 1941 GRANT 67
26765 1962 GRANT 157
26771 1968 GRANT 132
26774 1971 GRANT 93
26776 1973 GRANT 89
26783 1980 GRANT 69
NA NA <NA> NA
NA.1 NA <NA> NA
NA.2 NA <NA> NA
NA.3 NA <NA> NA
> generate_names("female","mary")
[1] 83035 3
Year Name Frequency
57032 1955 MARY 572
57060 1983 MARY 579
57063 1986 MARY 390
NA NA <NA> NA
NA.1 NA <NA> NA
NA.2 NA <NA> NA
NA.3 NA <NA> NA
每个输出顶部的那些行甚至在频率方面都不是最高的。