使用R从多个aspx页面刮擦

时间:2018-06-14 22:27:01

标签: r web-scraping

我是一名语言学专业的学生在R做实验。我一直在寻找其他问题,并得到了很多帮助,但我现在陷入困境,因为我无法在我的案例中实现示例函数,并希望得到一些帮助。

首先,我想从这里开始每个学期:http://registration.boun.edu.tr/schedule.htm,以及每个部门:http://registration.boun.edu.tr/scripts/schdepsel.asp

生成列表实际上有点容易,因为最终的链接是这样的:http://registration.boun.edu.tr/scripts/sch.asp?donem=2017/2018-3&kisaadi=ATA&bolum=ATATURK+INSTITUTE+FOR+MODERN+TURKISH+HISTORY

其次,我需要选择课程的代码,名称,日期和时间,并标记学期,我做了。 (可能,我做得非常糟糕,但我做到了,不管怎么样!)

library("rvest")
library("dplyr")
library("magrittr")

# define the html
reg <- read_html("http://registration.boun.edu.tr/scripts/sch.asp?donem=2017/2018-3&kisaadi=ATA&bolum=ATATURK+INSTITUTE+FOR+MODERN+TURKISH+HISTORY")

# make the html a list of tables
regtable <- reg %>% html_table(fill = TRUE) 

# tag their year
regtable[[4]][ ,15] <- regtable[[1]][1,2]
regtable[[4]][1,15] <- "Semester"

# Change the Days and Hours to sth usable, but how and to what?
  # parse the dates, T and Th problem?
  # parse the hour 10th hour problem?

# get the necessary info
regtable <- regtable %>% .[4] %>%  as.data.frame() %>% select( . , X1 , X3 , X8 , X9 , V15)

# correct the names
names(regtable) <- regtable[1,]
regtable <- regtable[-1,]
View(regtable)

但问题是我想编写一个功能,我可以在20多个学期和50多个部门完成这项工作。任何帮助都会很棒!我正在这样做,以便我可以为我的部门进行课时优化。

我想我可以使用XML Package做得更好,但我无法理解如何使用它。

感谢您的帮助, Utku

2 个答案:

答案 0 :(得分:2)

这是基于你已经完成的答案。可能有更高效的解决方案,但这应该是一个良好的开端。您也没有说明您希望如何存储数据,因此我目前所做的将把学期和部门的每个组合分配到自己的数据框架,这为部门数量创造了巨大的数量。这不太理想,但我不知道你打算如何在收集后使用这些数据。

library("rvest")
library("dplyr")
library("magrittr")

# Create a Department list
dep_list <- read_html("http://registration.boun.edu.tr/scripts/schdepsel.asp")
# Take the read html and identify all objects of class menu2 and extract the
# href which will give you the final part of the url
dep_list <- dep_list %>% 
    html_nodes(xpath = '//*[@class="menu2"]') %>%
    xml_attr("href")

department_list <- gsub("/scripts/sch.asp?donem=", "", dep_list, fixed = TRUE)

# Create a list for all of the semesters
sem_list <- read_html("http://registration.boun.edu.tr/schedule.htm")
sem_list <- sem_list %>% html_table(fill = TRUE)
# Extract the table from the list needed
semester_df <- sem_list[[2]]
# The website uses a table for the dropdown but the values are all in the second cell
# of the second column as a string
semester_list <- semester_df$X2[2]
# Separate the string into a list at the space characters
semester_list <- unlist(strsplit(semester_list, "\\s+"))

# Loop through the list of departments and within each department loop through the
# list of semesters to get the data you want
for(dep in department_list){
    for(sem in semester_list){
        url <- paste("http://registration.boun.edu.tr/scripts/sch.asp?donem=", sem, dep, sep = "")
        reg <- read_html(url)

        # make the html a list of tables
        regtable <- reg %>% html_table(fill = TRUE) 
        # The data we want is in the 4th portion of the created list so extract that
        regtable <- regtable[[4]]
        # Rename the column headers to the values in the first row and remove the
        # first row
        regtable <- setNames(regtable[-1, ], regtable[1, ])

        # Create semester column and select the variables we want
        regtable <- regtable %>% 
          mutate(Semester = sem) %>% 
          select(Code.Sec, Name, Days, Hours, Semester)

        # Assign the created table to a dataframe
        # Could also save the file here instead
        assign(paste("table", sem, gsub(" ", "_", dep), sep = "_"), regtable)
    }
}

答案 1 :(得分:1)

感谢@Amanda我能够实现我想要的目标。剩下的就是抓取短名单列表,匹配它们并完成整个事情,但我可以通过创建列表来做我想做的事情。任何进一步评论,以更优雅地做到这一点表示赞赏!

library("rvest")
library("dplyr")
library("magrittr")

# Create a Department list
dep_list <- read_html("http://registration.boun.edu.tr/scripts/schdepsel.asp")
dep_list <- dep_list %>% html_table(fill = TRUE)
# Select the table from the html that contains the data we want
department_df <- dep_list[[2]]
# Rename the columns with the value of the first row and remove row
department_df <- setNames(department_df[-1, ], department_df[1, ])
# Combine the two columns into a list
department_list <- c(department_df[, 1], department_df[, 2])
# Edit the department list
# We can choose accordingly.
department_list <- department_list[c(7,8,16,20,26,33,36,37,38,39)]


# Create a list for all of the semesters
sem_list <- read_html("http://registration.boun.edu.tr/schedule.htm")
sem_list <- sem_list %>% html_table(fill = TRUE)
# Extract the table from the list needed
semester_df <- sem_list[[2]]
# The website uses a table for the dropdown but the values are all in the second cell
# of the second column as a string
semester_list <- semester_df$X2[2]
# Separate the string into a list at the space characters
semester_list <- unlist(strsplit(semester_list, "\\s+"))
# Shortnames string
# We can add whichever we want.
shortname_list <- c("FLED", "HIST" , "PSY", "LL" , "PA" , "PHIL" , "YADYOK" , "SOC" , "TR" , "TKL" )
# Length
L = length(department_list)

# the function to get the schedule for the selected departments 
for( i in 1:L){
  for(sem in semester_list){tryCatch({
    dep <- department_list[i]
    sn <- shortname_list[i]
    url_second_part <- interaction("&kisaadi=" , sn, "&bolum=", gsub(" ", "+", (gsub("&" , "%26", dep))), sep = "", lex.order = TRUE)
    url <- paste("http://registration.boun.edu.tr/scripts/sch.asp?donem=", sem, url_second_part, sep = "")
    reg <- read_html(url)

    # make the html a list of tables
    regtable <- reg %>% html_table(fill = TRUE) 
    # The data we want is in the 4th portion of the created list so extract that
    regtable <- regtable[[4]]
    # Rename the column headers to the values in the first row and remove the
    # first row
    regtable <- setNames(regtable[-1, ], regtable[1, ])

    # Create semester column and select the variables we want
    regtable <- regtable %>% 
      mutate(Semester = sem) %>% 
      select(Code.Sec, Name, Days, Hours, Semester)

    # Assign the created table to a dataframe
    # Could also save the file here instead
    assign(paste("table", sem, gsub(" ", "_", dep), sep = "_"), regtable)
  }, error = function(e){cat("ERROR : No information on this" , url , "\n" )})
  }
}  

### Maybe make Errors another dataset or list too.