如何在两个标题之间的行中提取信息?

时间:2017-06-03 04:53:37

标签: python-2.7

我是python的新手,我正在尝试使用当前无效的代码从文本文件中提取两个标头之间的信息。

with open('toysystem.txt','r') as f:
  start = '<Keywords>'
  end = '</Keywords>'
  i = 0
  lines = f.readlines()
  for line in lines:
   if line == start:
    keywords = lines[i+1]
 i += 1

作为参考,文本文件如下所示:

<Keywords>
GTO
</Keywords>

关于代码可能出错的任何想法?或者可能采用不同的方法来解决这个问题?

谢谢!

2 个答案:

答案 0 :(得分:1)

  • 从文件中读取的行末尾包含换行符号,因此我们可能应该strip

  • f对象是iterator,因此我们不需要在此使用str.readlines方法。

所以我们可以编写类似

的内容
with open('toysystem.txt', 'r') as f:
    start = '<Keywords>'
    end = '</Keywords>'
    keywords = []
    for line in f:
        if line.rstrip() == start:
            break
    for line in f:
        if line.rstrip() == end:
            break
        keywords.append(line)

给我们

>>> keywords
['GTO\n']

如果您不需要在关键字末尾添加换行符 - 也可以删除它们:

with open('toysystem.txt', 'r') as f:
    start = '<Keywords>'
    end = '</Keywords>'
    keywords = []
    for line in f:
        if line.rstrip() == start:
            break
    for line in f:
        if line.rstrip() == end:
            break
        keywords.append(line.rstrip())

给出

>>> keywords
['GTO']

但在这种情况下,最好创建像

这样的剥离线generator
with open('toysystem.txt', 'r') as f:
    start = '<Keywords>'
    end = '</Keywords>'
    keywords = []
    stripped_lines = (line.rstrip() for line in f)
    for line in stripped_lines:
        if line == start:
            break
    for line in stripped_lines:
        if line == end:
            break
        keywords.append(line)

也是如此。

最后,如果您需要在脚本的下一部分中使用您的行,我们可以使用str.readlines和剥离行生成器:

with open('test.txt', 'r') as f:
    start = '<Keywords>'
    end = '</Keywords>'
    keywords = []
    lines = f.readlines()
    stripped_lines = (line.rstrip() for line in lines)
    for line in stripped_lines:
        if line.rstrip() == start:
            break
    for line in stripped_lines:
        if line.rstrip() == end:
            break
        keywords.append(line.rstrip())

给我们

>>> lines
['<Keywords>\n', 'GTO\n', '</Keywords>\n']
>>> keywords
['GTO']

进一步阅读

答案 1 :(得分:0)

使用Python re模块并使用正则表达式解析它?!

import re
with open('toysystem.txt','r') as f:
    contents = f.read()
    # will find all the expressions in the file and return a list of values inside the (). You can extend the expression according to your need.
    keywords = re.findall(r'\<keywords\>\s*\n*\s*(.*?)\s*\n*\s*\<\/keywords\>')
    print(keywords)
从您的文件

将打印

['GTO']

有关正则表达式和python检查的更多信息TutorialspointFor python3Python2