Question

我的文件包含以下几行：

lines.txt

1. robert
   smith
2. harry
3. john

我想要获得如下数组：

["robert\nsmith","harry","john"]

我尝试过这样的事情：

with open('lines.txt') as fh:
    m = [re.match(r"^\d+\.(.*)",line) for line in fh.readlines()]
    print(m)
    for i in m:
        print(i.groups())

它输出以下内容：

[<_sre.SRE_Match object; span=(0, 9), match='1. robert'>, None, <_sre.SRE_Match object; span=(0, 8), match='2. harry'>, <_sre.SRE_Match object; span=(0, 7), match='3. john'>]
(' robert',)
Traceback (most recent call last):
  File "D:\workspaces\workspace6\PdfGenerator\PdfGenerator.py", line 5, in <module>
    print(i.groups())
AttributeError: 'NoneType' object has no attribute 'groups'

似乎我以非常错误的方式解决了这个问题。您将如何解决这个问题？

Answer 1

您可以将文件读入内存并使用

r'(?ms)^\d+\.\s*(.*?)(?=^\d+\.|\Z)'

请参见regex demo

详细信息

(?ms)-启用re.MULTILINE和re.DOTALL模式
^-一行的开头
\d+-1个以上数字
\.-一个点
\s*-超过0个空格
(.*?)-第1组（这是re.findall在这里返回的内容）：任意0个以上的字符，数量尽可能少
(?=^\d+\.|\Z)-直到（但不包括）第一次出现
- ^\d+\.-一行的开始，超过1位数字和.
- |-或
- \Z-字符串的结尾。

Python：

with open('lines.txt') as fh:
    print(re.findall(r'(?ms)^\d+\.\s*(.*?)(?=^\d+\.|\Z)', fh.read()))

Answer 2

使用CountryCode查找从UsersWithTheirExpenses模式到下一个'\ n \ d'模式或最后一个模式的所有内容

re.findall

Answer 3

您可以使用re.split。

正则表达式：\n?\d+\.\s*

详细信息：

\n-换行符
?-匹配0到1次，如果存在'new line'，则匹配
\d+-在一个和无限次之间匹配一个数字（+）
\.-点
\s*-匹配零到无限制时间之间的任何空格字符（等于[\r\n\t\f\v ]）（*）

Python代码：

re.split(r'\n?\d+\.\s*', lines)[1:]

[1:]删除第一项，因为它的空字符串

输出：

['robert\n   smith', 'harry', 'john']

Answer 4

我提出了一个解决方案，该解决方案仅收集个名称，名称中间没有多余的空格，与其他解决方案相反。

想法是：

保存元组列表（数字，名称段），“复制” 当前行中缺少上一行的组号线。通过 getPair 函数准备要保存的配对。
将这些元组分组在数字上（第一个元素）。
使用 \ n 作为分隔符，加入每个组的名称段。
将这些联接的名称保存在结果列表中。

使用列表推导功能可以将程序编写为一种非常简洁的方式。见下文：

Option Explicit

Sub foo()
Dim cl As Range
Dim i As Long, index As Long
Dim thisSentence As String
Dim words() As String

For i = 2 To Range("A" & Rows.Count).End(xlUp).Row
    ' normalize our sentence, upper-case and replace consecutive spaces
    thisSentence = Replace(UCase(Cells(i, 1).value), "  ", " ")
    words = Split(thisSentence, " ")
    index = arrayIndex(words, "PM:")
    If index >= 0 Then
        Cells(i, 2).value = words(index + 1)
    End If
Next
End Sub

Function arrayIndex(words() As String, value As String) As Long
' NOTE: If "PM:" is the LAST item in the words array, this will return a -1 value
'       because there is no "name" to return.
'
Dim ret As Long
Dim i As Long
ret = -1
For i = LBound(words) To UBound(words) - 1
    If words(i) = value Then
        ret = i
        GoTo EarlyExit
    End If
Next
arrayIndex = ret
End Function

总而言之，该程序比其他程序长一点，但是仅名称，没有其他空格。

使用python正则表达式从文件中的编号列表中获取内容

4 个答案: