R删除HTML标记之间的标点符号

时间:2014-07-30 09:25:47

标签: html r string preprocessor tm

我是R的新手,我必须在具有以下形式的数据集上使用文本挖掘

</DOC>
<DOC>
<DATE>08/31/2006</DATE>
<AUTHOR>Roy</AUTHOR>
<TEXT>I recently bought an 2007 Volvo XC90 with the 3.2 6 cylinder motor</TEXT>
<FAVORITE>The seats</FAVORITE>
</DOC>

我需要进行预处理,但只能在TEXT和。之间进行 FAVORITE标签,因为我想保留HTML标签以便以后提取日期。

我该怎么做?常规函数会破坏html标记。

reviews <- tm_map(reviews, removePunctuation);

由于

1 个答案:

答案 0 :(得分:3)

您好,您可以使用包XML来解析您的HTML文件,然后访问您想要的标签:

writeLines(text = "<DOC>
<DATE>08/31/2006</DATE>
<AUTHOR>Roy</AUTHOR>
<TEXT>I recently bought an 2007 Volvo XC90 with the 3.2 6 cylinder motor</TEXT>
<FAVORITE>The seats</FAVORITE>
</DOC>", con = "example.html")

# Parse the HTML file with XML
library(XML)
your_html <- xmlParse(file = "example.html")
your_html <- xmlToList(node = your_html)
your_html$TEXT
# [1] "I recently bought an 2007 Volvo XC90 with the 3.2 6 cylinder motor"
your_html$FAVORITE
# [1] "The seats"

# Do what you want on your corpus
library(tm)
corpus <- Corpus(VectorSource(c(your_html$TEXT, your_html$FAVORITE)))
corpus <- tm_map(corpus, removePunctuation)
inspect(corpus)
# <<VCorpus (documents: 2, metadata (corpus/indexed): 0/0)>>

# [[1]]
# <<PlainTextDocument (metadata: 7)>>
# I recently bought an 2007 Volvo XC90 with the 32 6 cylinder motor

# [[2]]
# <<PlainTextDocument (metadata: 7)>>
# The seats