Question

我试图将我的Facebook聊天消息从.htm文件中提取到正确的数据帧中。 Rvest通过将html节点（user，meta，p）提取到向量然后df中来帮助我。但是，我仍然坚持这一部分：

<div class="thread">
    John, My Name"
    <div class="message">
        <div class="message_header">
            <span class="user">My Name</span>
            <span class="meta">Thursday, April 9, 2015 at 12:55am UTC+07</span>
        </div>
    </div>
    <p>Hello, how are you today</p>


//Other <div class = "message">
//Other <div class = "thread">

＆＃34;螺纹＆＃34;标志着我与一个人的对话，以及＆＃34;消息＆＃34;显示我的消息。上课＆＃34;用户＆＃34;有时只显示＆＃34;我的名字＆＃34;，而不是＆＃34;约翰＆＃34;或者＆＃34;杰克＆＃34;，我需要提取字符串＆＃34;约翰，我的名字＆＃34;作为另一个变量，并忽略后续嵌套＆＃34;消息中的所有文本＆＃34;类。

我怀疑这是我需要的正则表达式。我也尝试将Xpath用于html_nodes，但/html/body/div[**x**]/div[**y**]/div[**z**]/text()不允许我动态更改xpath以读取所有线程类（x，y，z变化，并且它是160mb htm文件）。

感谢任何帮助！

编辑：我的代码：

library(rvest)
library(XML)
url <- read_html("messages.htm")

users<-html_nodes(x = url, css = ".user") %>% html_text()
date<-html_nodes(x = url, css = ".meta") %>% html_text()
#Repeat

df <- cbind(users, date, etc.)     

#Extracting the names of the thread with xpath
threadget <- function(n){
  html_text(html_node(url, xpath = sub("n", n, "/html/body/div[2]/div/div[n]/text()")))
}
for (n in c(seq(1,553,1))){thread[n] = threadget(n)}

Answer 1

这是我实施@Jota建议后的代码

#Finding the length of each thread for looping using html_children() and length()
list <- html_nodes(url, css = ".thread")
count <- sapply(list, html_children)
threadlength <- sapply(count, length)
#Extracting the names of the thread using xpath
threadlist <- html_nodes(url2, xpath = '*//div[@class = "thread"]/text()[1]') %>% html_text()

#Creating the thread column
#x indicates how many rows a thread topic should be duplicated. 
#y is used to subset the thread column. 
#z is used to close the inner loop, moving to the next thread topic
thread <- c()
n <- 0
y <- 0
for (x in threadlength) {
  z <- 0
  n <- n+1
  repeat{
    y <- y+1
    z <- z+1
    thread[y] <- threadlist[n]
    if (z == x){
      break
    }
  }
}

将Facebook htm文件转换为R

1 个答案: