我想从这样的html代码中获取一个值:
<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:
结果我只需要值:&#34; 53&#34;
如何使用linux命令行工具(如grep,awk或sed)完成此操作?我想在覆盆子pi上使用它... r
尝试这个不起作用:
root@raspberrypi:/home/pi# echo "<div>Luftfeuchte: <span id="wob_hm">53%</span></div><div>Wind:" >> test.txt
root@raspberrypi:/home/pi# grep -oP '<span id="wob_hm">\K[0-9]+(?=%</span>)' test.txt
root@raspberrypi:/home/pi#
答案 0 :(得分:0)
由于HTML不是平面文字格式,因此不建议使用grep
,sed
或awk
等平面文字工具处理HTML。如果HTML的格式略有变化(例如:如果span
节点获得另一个属性或者在某处插入换行符),那么以这种方式构建的任何内容都会有破坏的倾向。
使用构建用于解析HTML的内容会更加健壮(如果更加费力)。在这种情况下,我考虑使用Python,因为它的标准库中有一个(基本的)HTML解析器。它看起来大致如下:
#!/usr/bin/python3
import html.parser
import re
import sys
# html.parser.HTMLParser provides the parsing functionality. It tokenizes
# the HTML into tags and what comes between them, and we handle them in the
# order they appear. With XML we would have nicer facilities, but HTML is not
# a very good format, so we're stuck with this.
class my_parser(html.parser.HTMLParser):
def __init__(self):
super(my_parser, self).__init__(self)
self.data = ''
self.depth = 0
# handle opening tags. Start counting, assembling content when a
# span tag begins whose id is "wob_hm". A depth counter is maintained
# largely to handle nested span tags, which is not strictly necessary
# in your case (but will make this easier to adapt for other things and
# is not more complicated to implement than a flag)
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
# handle end tags. Make sure the depth counter is only positive
# as long as we're in the span tag we want
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
# when data comes, assemble it in a string. Note that nested tags would
# not be recorded by this if they existed. It would be more work to
# implement that, and you don't need it for this.
def handle_data(self, data):
if self.depth > 0:
self.data += data
# open the file whose name is the first command line argument. Do so as
# binary to get bytes from f.read() instead of a string (which requires
# the data to be UTF-8-encoded)
with open(sys.argv[1], "rb") as f:
# instantiate our parser
p = my_parser()
# then feed it the file. If the file is not UTF-8, it is necessary to
# convert the file contents to UTF-8. I'm assuming latin1-encoded
# data here; since the example looks German, "latin9" might also be
# appropriate. Use the encoding in which your data is encoded.
p.feed(f.read().decode("latin1"))
# trim (in case of newlines/spaces around the data), remove % at the end,
# then print
print(re.compile('%$').sub('', p.data.strip()))
附录:这是Python 2的后端,它推翻了编码问题。对于这种情况,这可以说是更好的,因为编码对我们想要提取的数据并不重要,并且您不必事先知道输入文件的编码。这些变化很小,它的工作方式完全相同:
#!/usr/bin/python
from HTMLParser import HTMLParser
import re
import sys
class my_parser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.data = ''
self.depth = 0
def handle_starttag(self, tag, attrs):
if tag == 'span':
if ('id', 'wob_hm') in attrs:
self.data = ''
self.depth = 0
self.depth += 1
def handle_endtag(self, tag):
if tag == 'span':
self.depth -= 1
def handle_data(self, data):
if self.depth > 0:
self.data += data
with open(sys.argv[1], "r") as f:
p = my_parser()
p.feed(f.read())
print(re.compile('%$').sub('', p.data.strip()))