使用R grepl删除HTML文件中的一行

时间:2018-07-15 21:38:14

标签: r grepl

我在名为doc的对象中有一个HTLM文档

> doc

<!DOCTYPE html>
<h1>Hello</h1>
<br>
<p>I am an html file</p>
<script myscript1 src="https://website.com/javascripts.js" type="text/javascript"></script>
<p>I am a paragraph</p>
<script myscript2 src="https://website2.com/function.js" type="text/javascript"></script>

我的目标是创建一个R函数,以从doc中删除脚本为myscript1的行

<script myscript1 src="https://website.com/javascripts.js" type="text/javascript"></script>

我尝试了以下代码,但是它不起作用:

remove <- "<script myscript1 src="https://website.com/javascripts.js" type="text/javascript"></script>"
doc <- doc[!grepl(paste(remove), doc),]

注意:删除myscript1后,由于使用了xPath,我需要从文档中获取一些元素。

能帮我吗?谢谢

1 个答案:

答案 0 :(得分:1)

一种方法是首先将html文件的字符向量表示形式转换为R并进行处理。为此,我们可以将externalptr对象(blob)写为文本html文件,然后使用基本函数readLines读回。考虑:

class MyTime:
        """ Create some time """

    def __init__(self,hrs = 0,mins = 0,sec = 0):
        """Splits up whole time into only seconds"""
        totalsecs = hrs*3600 + mins*60 + sec
        self.hours = totalsecs // 3600
        leftoversecs = totalsecs % 3600
        self.minutes = leftoversecs // 60
        self.seconds = leftoversecs % 60
    def __str__(self):
        return '{0}:{1}: 
             {2}'.format(self.hours,self.minutes,self.seconds)

    def to_seconds(self):
        # converts to only seconds
        return (self.hours * 3600) + (self.minutes * 60) + self.seconds

def between(t1,t2,x):
    t1seconds = t1.to_seconds()
    t2seconds = t2.to_seconds()
    xseconds = x.to_seconds()
    if t1seconds <= xseconds  < t2seconds:
        return True
    return False


currentTime = MyTime(0,0,0)
doneTime = MyTime(10,3,4)
x = MyTime(2,0,0)
print(between(currentTime,doneTime,x))