Question

所有

我正在尝试编写一个python脚本，该脚本将通过犯罪文件并根据以下项目分隔文件：更新，事件和ARRESTS。我通常收到的报告要么显示我之前列出的这些部分，要么** **更新**，**事件**或** ARRESTS **。我已经开始编写以下脚本，以便根据以下格式将文件与**分开。但是，我想知道是否有更好的方法同时检查两种格式的文件？此外，有时没有更新或ARRESTS部分导致我的代码中断。我想知道我是否可以为这个实例做一个检查，如果是这样的话，我怎么能在没有其他两个的情况下获得INCIDENTS部分呢？

with open('CrimeReport20150518.txt', 'r') as f:
  content = f.read()
  print content.index('**UPDATES**')
  print content.index('**INCIDENTS**')
  print content.index('**ARRESTS**')
  updatesLine = content.index('**UPDATES**')
  incidentsLine = content.index('**INCIDENTS**')
  arrestsLine = content.index('**ARRESTS**')
  #print content[updatesLine:incidentsLine]
  updates = content[updatesLine:incidentsLine]
  #print updates
  incidents = content[incidentsLine:arrestsLine]
  #print incidents
  arrests = content[arrestsLine:]
  print arrests

Answer 1

您目前正在使用.index()找到文本中的标题。文档说明：

与find（）类似，但在找不到子字符串时引发ValueError。

这意味着您需要捕获异常才能处理它。例如：

try:
    updatesLine = content.index('**UPDATES**')
    print "Found updates heading at", updatesLine
except ValueError:
    print "Note: no updates"
    updatesLine = -1

从这里，您可以根据存在的部分确定切割字符串的正确索引。

或者，您可以使用.index()文档中引用的.find()方法。

如果未找到sub，则返回-1。

使用find，您只需测试它返回的值。

updatesLine = content.find('**UPDATES**')
# the following is straightforward, but unwieldy
if updatesLine != -1:
    if incidentsLine != -1:
        updates = content[updatesLine:incidentsLine]
    elif arrestsLine != -1:
        updates = content[updatesLine:arrestsLine]
    else:
        updates = content[updatesLine:]

无论哪种方式，您都必须处理哪些部分是否存在以确定正确的切片边界。

我更愿意使用状态机来解决这个问题。逐行读取文件并将该行添加到相应的列表中。找到标题后再更新状态。这是一个未经测试的原理演示：

data = {
    'updates': [],
    'incidents':  [],
    'arrests': [],
    }

state = None
with open('CrimeReport20150518.txt', 'r') as f:
    for line in f:
        if line == '**UPDATES**':
            state = 'updates'
        elif line == '**INCIDENTS**':
            state = 'incidents'
        elif line == '**ARRESTS**':
            state = 'arrests'
        else:
            if state is None:
                print "Warn: no header seen; skipping line"
            else
                data[state].append(line)

print data['arrests'].join('')

Answer 2

尝试使用content.find()代替content.index()。它不会在字符串不存在时断开，而是返回-1。然后你可以做这样的事情：

updatesLine = content.find('**UPDATES**')
incidentsLine = content.find('**INCIDENTS**')
arrestsLine = content.find('**ARRESTS**')

if incidentsLine != -1 and arrestsLine != -1:

       # Do what you normally do
       updatesLine = content.index('**UPDATES**')
       incidentsLine = content.index('**INCIDENTS**')
       arrestsLine = content.index('**ARRESTS**')

       updates = content[updatesLine:incidentsLine]
       incidents = content[incidentsLine:arrestsLine]
       arrests = content[arrestsLine:]

elif incidentsLine != -1:
     # Do whatever you need to do to files that don't have an arrests section here

elif arreststsLine != -1:
     # Handle files that don't have an incidents section here

else:
     # Handle files that are missing both

可能你需要稍微不同地处理所有四种可能的组合。

您的解决方案通常看起来不错，只要这些部分的顺序始终相同且文件不会太大。您可以在堆栈交换的代码审核https://codereview.stackexchange.com/

中获得真实的反馈

Python - 根据3个值的格式执行文件检查，然后执行任务

2 个答案: