Question

我有一个类似于

的文本文件

$ abc
defghjik
am here
not now
$ you
are not
here but go there
$ ....

我想在两个$符号之间提取文本，并将该文本放入列表或词典中。如何通过读取文件在python中执行此操作？

我尝试了正则表达式，但它为我提供了文本文件的替代值：

f1 = open('some.txt','r')
lines = f1.read()
x = re.findall(r'$(.*?)$', lines, re.DOTALL)

我希望输出如下所示 - ['abc'，'defghjik'，'我在这里'，'不是现在'] ['你'，'不是'，'这里，但去那里']

抱歉，我是python的新手并且正在努力学习，感谢任何帮助！谢谢！

Answer 1

正则表达式中的

$ has a special meaning，所以为了匹配它，你需要首先逃避它。请注意，在字符类（[]）内，$和其他metcharatcers会失去其特殊含义，因此不需要转义。正则表达式应该这样做：

\$\s*([^$]+)(?=\$)

Regular expression visualization

Debuggex Demo

<强>演示：

>>> lines = '''$ abc                                         
defghjik
am here
not now
$ you
are not
here but go there
$'''
>>> it = re.finditer(r'\$\s*([^$]+)(?=\$)', lines, re.DOTALL)
>>> [x.group(1).splitlines() for x in it]
[['abc', 'defghjik', 'am here', 'not now'], ['you', 'are not', 'here but go there']]

Answer 2

在正则表达式中$是一个具有特殊含义的字符，需要进行转义以匹配文字字符。为了匹配多个部分，我将使用lookahead (?=...)断言断言匹配文字$字符。

>>> x = re.findall(r'(?s)\$\s*(.*?)(?=\$)', lines)
>>> [i.splitlines() for i in x]
[['abc', 'defghjik', 'am here', 'not now'], ['you', 'are not', 'here but go there']]

Working Demo

Answer 3

正则表达式可能实际上并不是您想要的：您想要的输出将每一行作为列表中的单个条目。我建议只使用lines.split（），然后遍历生成的数组。

我会写这个，好像你只需要打印你想要的文本作为输出。根据需要进行调整。

f1 = open('some.txt','r')
lines = f1.read()

lists = []
for s in lines.split('\n'):
    if s == '$':
        if lists:
            print lists
            lists = []
    else: lists.append(s)
if lists: print lists

快乐的Python！欢迎来到俱乐部。：）

Answer 4

$在正则表达式中具有特殊含义。它是一个锚。它匹配字符串的结尾或字符串末尾的换行符之前。见这里：
Regular Expression Operations
您可以通过在前面添加'\'字符来转义$符号，因此不会将其视为锚点更好的是，你根本不需要使用正则表达式。您可以在python中使用split的split方法。

>>> string = '''$ abc
defghjik
am here
not now
$ you
are not
here but go there
$ '''
>>> string.split('$')
['', ' abc\ndefghjik\nam here\nnot now\n', ' you\nare not\nhere but go there\n', ' ']

你得到一份清单。要删除空字符串条目，可以执行以下操作：

a=string.split('$')
while a.count('') > 0:
a.remove('')

Answer 5

读取部分文件通常归结为“迭代模式”。 itertools包中有许多生成器可以提供帮助。或者你可以制作自己的发电机。例如：

def take_sections(predicate, iterable, firstpost=lambda x:x):
    i = iter(iterable)
    try:
        nextone = i.next()
        while True:
            batch = [ firstpost(nextone) ]
            nextone = i.next()
            while not predicate(nextone):
                batch.append(nextone)
                nextone = i.next()
            yield batch
    except StopIteration:
        yield batch
        return

这类似于itertools.takewhile，除了它更像是直到循环（即在底部测试，而不是顶部）。它还有一个内置的清理/后处理功能，用于一个部分的第一行（“部分标记”）。一旦你抽象出这个迭代模式，你需要读取文件中的行，定义如何识别和清理部分标记，并运行生成器：

with open('some.txt','r') as f1:
    lines = [ l.strip() for l in f1.readlines() ]

dollar_line = lambda x: x.startswith('$')
clean_dollar_line = lambda x: x[1:].lstrip()

print list(take_sections(dollar_line, lines, clean_dollar_line))

产量：

[['abc', 'defghjik', 'am here', 'not now'], 
 ['you', 'are not', 'here but go there'], 
 ['....']]

python - 在两个$之间找到文本并将它们放入列表中

5 个答案: