Question

我在文本解析方面遇到了麻烦。

标题：通过Grab lib我获取html页面，之后我通过文本中的NLTK lib转换它，并将此文本放在变量中。在此之后，我想搜索包含“word”的所有行，并打印此行。

例如，我们有下一个文字：

test1：olololo
test2：打印东西
常见问题解答常见问题（s）
我想要搜索test1，并将结果打印为：test1: olololo

import logging, nltk
from grab import Grab
from urllib import urlopen

logging.basicConfig(level=logging.DEBUG)
parsing_url = raw_input("Enter URL:")
if parsing_url.startswith('http://') or parsing_url.startswith('https://'):
    parsing_url = parsing_url.replace('http://','').replace('https://','')
print parsing_url
g = Grab()
g.go('http://user:pass@' + parsing_url, log_file='out.html')
url = "out.html"
html = urlopen(url).read()
raw = nltk.clean_html(html)

在bash中，我意识到这一点：

root@srv:~$ cat 123 | grep "test1"

结果我得到了：

test1: olololo

但在Python中我不想执行bash命令：）

Answer 1

试试这个：

for line in html.split():
  if "test1" in line:
    print line

Answer 2

假设raw是一个字符串列表（即行列表）：

good_lines = [l for l in raw if 'test1' in l]

Answer 3

也许有人可能觉得它很有用，我这样解决这个问题： 1.使用NLTK lib将html解码为文本 2.将此文本记录到文件中 3.通过bash命令解析文件。例如：

status,host = commands.getstatusoutput("cat raw.log | sed 's/^[ \t]*//' | grep -A 2 \"On Host\" | sed -n 2p")

另外，我正在尝试使用工具python

解析此文本

Python：搜索包含“word”的所有行

3 个答案: