如何从Python中的文本中提取单词

时间:2011-08-18 19:59:02

标签: python

我有这个字符串“IP 1.2.3.4当前在白名单中受信任,但它现在在日志文件中使用新的可信证书。”。我需要做的是查找此消息并从日志文件中提取IP地址(1.2.3.4)。

import os
import shutil
import optparse
import sys

def main():
    file = open("messages", "r")
    log_data = file.read()
    file.close()

    search_str = "is currently trusted in the white list, but it is now using a new trusted certificate."

    index = log_data.find(search_str)
    print index

    return

if __name__ == '__main__':
    main()

如何提取IP地址?感谢您的回复。

4 个答案:

答案 0 :(得分:5)

答案非常简单:

msg = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."

parts = msg.split(' ', 2)

print parts[1]

结果:

1.2.3.4

如果你愿意,你也可以做RE,但对于这个简单的事情......

答案 1 :(得分:2)

将有许多可能的方法,优点和缺点取决于您的日志文件的详细信息。一个例子,使用re module

import re
x = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
pattern = "IP ([0-9\.]+) is currently trusted in the white list"
m = re.match(pattern, x)
for ip in m.groups():
    print ip

如果要在日志文件中打印出该字符串的每个实例,您可以执行以下操作:

import re
pattern = "(IP [9-0\.]+ is currently trusted in the white list, but it is now using a new trusted certificate.)"
m = re.match(pattern, log_data)
for match in m.groups():
    print match

答案 2 :(得分:1)

使用正则表达式。

这样的代码:

import re

compiled = re.compile(r"""
    .*?                                # Leading junk
    (?P<ipaddress>\d+\.\d+\.\d+\.\d+)  # IP address
    .*?                                # Trailing junk
    """, re.VERBOSE)
str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
m = compiled.match(str)
print m.group("ipaddress")

你明白了:

>>> import re
>>> 
>>> compiled = re.compile(r"""
...     .*?                                # Leading junk
...     (?P<ipaddress>\d+\.\d+\.\d+\.\d+)  # IP address
...     .*?                                # Trailing junk
...     """, re.VERBOSE)
>>> str = "IP 1.2.3.4 is currently trusted in the white list, but it is now using a new trusted certificate."
>>> m = compiled.match(str)
>>> print m.group("ipaddress")
1.2.3.4

另外,我在那里学到了一个匹配词典,groupdict():

>>>> str = "Peer 10.11.6.224 is currently trusted in the white list, but it is now using a new trusted certificate. Consider removing its likely outdated white list entry."
>>>> m = compiled.match(str)
>>>> print m.groupdict()
{'ipaddress': '10.11.6.224'}

后来:修好了。最初的'。*'正在吃你的第一个角色匹配。改变它是非贪婪的。为了保持一致性(但不是必要性),我也改变了尾随匹配。

答案 3 :(得分:1)

正则表达是要走的路。但是如果你不舒服地写它们,你可以试一下我写的小解析器(https://github.com/hgrecco/stringparser)。它将字符串格式转换为正则表达式。在您的情况下,您将执行以下操作:

from stringparser import Parser

parser = Parser("IP {} is currently trusted in the white list, but it is now using a new trusted certificate.")

ip = parser(text)

如果您有一个包含多行的文件,则可以将最后一行替换为:

with open("log.txt", "r") as fp:
    ips = [parser(line) for line in fp]
祝你好运。