grep / sed / awk - 从html代码中提取子字符串

时间:2015-04-04 17:39:13

标签: regex awk sed grep raspberry-pi

我想从这样的html代码中获取一个值:

<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:

结果我只需要值:&#34; 53&#34;

如何使用linux命令行工具(如grep,awk或sed)完成此操作?我想在覆盆子pi上使用它... r

尝试这个不起作用:

root@raspberrypi:/home/pi# echo "<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:" >> test.txt
root@raspberrypi:/home/pi# grep -oP '<span id="wob_hm">\K[0-9]+(?=%</span>)' test.txt
root@raspberrypi:/home/pi# 

1 个答案:

答案 0 :(得分:0)

由于HTML不是平面文字格式,因此不建议使用grepsedawk等平面文字工具处理HTML。如果HTML的格式略有变化(例如:如果span节点获得另一个属性或者在某处插入换行符),那么以这种方式构建的任何内容都会有破坏的倾向。

使用构建用于解析HTML的内容会更加健壮(如果更加费力)。在这种情况下,我考虑使用Python,因为它的标准库中有一个(基本的)HTML解析器。它看起来大致如下:

#!/usr/bin/python3

import html.parser
import re
import sys

# html.parser.HTMLParser provides the parsing functionality. It tokenizes
# the HTML into tags and what comes between them, and we handle them in the
# order they appear. With XML we would have nicer facilities, but HTML is not
# a very good format, so we're stuck with this.
class my_parser(html.parser.HTMLParser):
    def __init__(self):
        super(my_parser, self).__init__(self)
        self.data  = ''
        self.depth = 0

    # handle opening tags. Start counting, assembling content when a
    # span tag begins whose id is "wob_hm". A depth counter is maintained
    # largely to handle nested span tags, which is not strictly necessary
    # in your case (but will make this easier to adapt for other things and
    # is not more complicated to implement than a flag)
    def handle_starttag(self, tag, attrs):
        if tag == 'span':
            if ('id', 'wob_hm') in attrs:
                self.data = ''
                self.depth = 0
            self.depth += 1

    # handle end tags. Make sure the depth counter is only positive
    # as long as we're in the span tag we want
    def handle_endtag(self, tag):
        if tag == 'span':
            self.depth -= 1

    # when data comes, assemble it in a string. Note that nested tags would
    # not be recorded by this if they existed. It would be more work to
    # implement that, and you don't need it for this.
    def handle_data(self, data):
        if self.depth > 0:
            self.data += data

# open the file whose name is the first command line argument. Do so as
# binary to get bytes from f.read() instead of a string (which requires
# the data to be UTF-8-encoded)
with open(sys.argv[1], "rb") as f:
    # instantiate our parser
    p = my_parser()

    # then feed it the file. If the file is not UTF-8, it is necessary to
    # convert the  file contents to UTF-8. I'm assuming latin1-encoded
    # data here; since the example looks German, "latin9" might also be
    # appropriate. Use the encoding in which your data is encoded.
    p.feed(f.read().decode("latin1"))

    # trim (in case of newlines/spaces around the data), remove % at the end,
    # then print
    print(re.compile('%$').sub('', p.data.strip()))

附录:这是Python 2的后端,它推翻了编码问题。对于这种情况,这可以说是更好的,因为编码对我们想要提取的数据并不重要,并且您不必事先知道输入文件的编码。这些变化很小,它的工作方式完全相同:

#!/usr/bin/python

from HTMLParser import HTMLParser
import re
import sys

class my_parser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.data  = ''
        self.depth = 0

    def handle_starttag(self, tag, attrs):
        if tag == 'span':
            if ('id', 'wob_hm') in attrs:
                self.data = ''
                self.depth = 0
            self.depth += 1

    def handle_endtag(self, tag):
        if tag == 'span':
            self.depth -= 1

    def handle_data(self, data):
        if self.depth > 0:
            self.data += data

with open(sys.argv[1], "r") as f:
    p = my_parser()
    p.feed(f.read())
    print(re.compile('%$').sub('', p.data.strip()))