我试图将我的Facebook聊天消息从.htm文件中提取到正确的数据帧中。 Rvest
通过将html节点(user,meta,p)提取到向量然后df中来帮助我。但是,我仍然坚持这一部分:
<div class="thread">
John, My Name"
<div class="message">
<div class="message_header">
<span class="user">My Name</span>
<span class="meta">Thursday, April 9, 2015 at 12:55am UTC+07</span>
</div>
</div>
<p>Hello, how are you today</p>
//Other <div class = "message">
//Other <div class = "thread">
&#34;螺纹&#34;标志着我与一个人的对话,以及&#34;消息&#34;显示我的消息。上课&#34;用户&#34;有时只显示&#34;我的名字&#34;,而不是&#34;约翰&#34;或者&#34;杰克&#34;,我需要提取字符串&#34;约翰,我的名字&#34;作为另一个变量,并忽略后续嵌套&#34;消息中的所有文本&#34;类。
我怀疑这是我需要的正则表达式。我也尝试将Xpath
用于html_nodes,但/html/body/div[**x**]/div[**y**]/div[**z**]/text()
不允许我动态更改xpath以读取所有线程类(x,y,z变化,并且它是160mb htm文件)。
感谢任何帮助!
编辑:我的代码:
library(rvest)
library(XML)
url <- read_html("messages.htm")
users<-html_nodes(x = url, css = ".user") %>% html_text()
date<-html_nodes(x = url, css = ".meta") %>% html_text()
#Repeat
df <- cbind(users, date, etc.)
#Extracting the names of the thread with xpath
threadget <- function(n){
html_text(html_node(url, xpath = sub("n", n, "/html/body/div[2]/div/div[n]/text()")))
}
for (n in c(seq(1,553,1))){thread[n] = threadget(n)}
答案 0 :(得分:0)
这是我实施@Jota建议后的代码
#Finding the length of each thread for looping using html_children() and length()
list <- html_nodes(url, css = ".thread")
count <- sapply(list, html_children)
threadlength <- sapply(count, length)
#Extracting the names of the thread using xpath
threadlist <- html_nodes(url2, xpath = '*//div[@class = "thread"]/text()[1]') %>% html_text()
#Creating the thread column
#x indicates how many rows a thread topic should be duplicated.
#y is used to subset the thread column.
#z is used to close the inner loop, moving to the next thread topic
thread <- c()
n <- 0
y <- 0
for (x in threadlength) {
z <- 0
n <- n+1
repeat{
y <- y+1
z <- z+1
thread[y] <- threadlist[n]
if (z == x){
break
}
}
}