将Facebook htm文件转换为R

时间:2017-02-13 02:27:34

标签: r regex web-scraping rvest

我试图将我的Facebook聊天消息从.htm文件中提取到正确的数据帧中。 Rvest通过将html节点(user,meta,p)提取到向量然后df中来帮助我。但是,我仍然坚持这一部分:

<div class="thread">
    John, My Name"
    <div class="message">
        <div class="message_header">
            <span class="user">My Name</span>
            <span class="meta">Thursday, April 9, 2015 at 12:55am UTC+07</span>
        </div>
    </div>
    <p>Hello, how are you today</p>


//Other <div class = "message">
//Other <div class = "thread"> 

&#34;螺纹&#34;标志着我与一个人的对话,以及&#34;消息&#34;显示我的消息。上课&#34;用户&#34;有时只显示&#34;我的名字&#34;,而不是&#34;约翰&#34;或者&#34;杰克&#34;,我需要提取字符串&#34;约翰,我的名字&#34;作为另一个变量,并忽略后续嵌套&#34;消息中的所有文本&#34;类。

我怀疑这是我需要的正则表达式。我也尝试将Xpath用于html_nodes,但/html/body/div[**x**]/div[**y**]/div[**z**]/text()不允许我动态更改xpath以读取所有线程类(x,y,z变化,并且它是160mb htm文件)。

感谢任何帮助!

编辑:我的代码:

library(rvest)
library(XML)
url <- read_html("messages.htm")

users<-html_nodes(x = url, css = ".user") %>% html_text()
date<-html_nodes(x = url, css = ".meta") %>% html_text()
#Repeat

df <- cbind(users, date, etc.)     

#Extracting the names of the thread with xpath
threadget <- function(n){
  html_text(html_node(url, xpath = sub("n", n, "/html/body/div[2]/div/div[n]/text()")))
}
for (n in c(seq(1,553,1))){thread[n] = threadget(n)} 

1 个答案:

答案 0 :(得分:0)

这是我实施@Jota建议后的代码

#Finding the length of each thread for looping using html_children() and length()
list <- html_nodes(url, css = ".thread")
count <- sapply(list, html_children)
threadlength <- sapply(count, length)
#Extracting the names of the thread using xpath
threadlist <- html_nodes(url2, xpath = '*//div[@class = "thread"]/text()[1]') %>% html_text()

#Creating the thread column
#x indicates how many rows a thread topic should be duplicated. 
#y is used to subset the thread column. 
#z is used to close the inner loop, moving to the next thread topic
thread <- c()
n <- 0
y <- 0
for (x in threadlength) {
  z <- 0
  n <- n+1
  repeat{
    y <- y+1
    z <- z+1
    thread[y] <- threadlist[n]
    if (z == x){
      break
    }
  }
}