Question

我有一个标签列表，例如：

 <a class="title" href="/forum-replies.cfm?t=2709069">Roaming to other network when yours is unavailable</a>,
 <a class="title" href="/forum-replies.cfm?t=2747612">Wifi Calling problem</a>,
 <a class="title" href="/forum-replies.cfm?t=2705042">Kogan Mobile non-compliance with Tax Invoices</a>,
 <a class="title" href="/forum-replies.cfm?t=2715307">please help internet god</a>,
 <a class="title" href="/forum-replies.cfm?t=2715014">Apple deals returning soon?</a>

以下正则表达式搜索功能可以在运行时提取所有101个匹配项：

import re
regex = re.compile('(?<=>)(\w|\d|\s)+')
ThreadNames = []
for string in ThreadNameFlat:
    ThreadNames.append(re.search(regex,str(string)))

当我尝试运行下一组代码只是自己抓住匹配项时，ThreadNames仅返回上述代码返回的全部101个匹配项中的10个，而ThreadNamesTest返回单个匹配项，例如'm','','n'等...

import re
regex = re.compile('(?<=>)(\w|\d|\s)+')
ThreadNames = []
ThreadNamesTest = []
for string in ThreadNameFlat:
    ThreadNames.append(re.search(regex,str(string)))
    match = re.search(regex,str(string))
    ThreadNamesTest.append(match.groups())

即使我认为groups()应该返回最初在第一个脚本中找到的所有101个匹配项，看来groups()函数也引起了问题。

编辑：我将其更改为.group()而不是.groups()，现在它返回具有完全匹配项的10/101标签。

我如何才能从101个标签中取出全部10个？

The new result is here

Answer 1

re.search返回一个Match对象。请参见docs。至于Match.groups：

返回包含匹配项所有子组的元组，范围从1到模式中的许多组。

但是您不在乎这里的子组-您只需要完全匹配。为此，您可能应该使用.group() or .group(0)：

返回匹配项的一个或多个子组。如果有单个参数，则结果为单个字符串；如果有多个参数，则结果是一个元组，每个参数有一个项目。不带参数的group1默认为零（将返回整个匹配项）。

赞：

import re
ThreadNameFlat = ['<a class="title" href="/forum-replies.cfm?t=2709069">Roaming to other network when yours is unavailable</a>',
' <a class="title" href="/forum-replies.cfm?t=2747612">Wifi Calling problem</a>',
' <a class="title" href="/forum-replies.cfm?t=2705042">Kogan Mobile non-compliance with Tax Invoices</a>',
' <a class="title" href="/forum-replies.cfm?t=2715307">please help internet god</a>',
' <a class="title" href="/forum-replies.cfm?t=2715014">Apple deals returning soon?</a>']

regex = re.compile('(?<=>)(?:\w|\d|\s)+')
ThreadNamesTest = []
for string in ThreadNameFlat:
    match = re.search(regex,str(string))
    ThreadNamesTest.append(match.group())
print(ThreadNamesTest)

输出：

['Roaming to other network when yours is unavailable', 'Wifi Calling problem', 'Kogan Mobile non', 'please help internet god', 'Apple deals returning soon']

Answer 2

想通了。

正则表达式与标签列表的第11个值不匹配。 .group()在第11个值上找不到任何内容后，便无法进行以下匹配。

将正则表达式更改为：

regex = re.compile('(?<=>)(\w|\d|\s)+')

收件人：

regex = re.compile('(?<=>)(.\w.|.\s)+')

它将匹配除换行符之外的任何字符，换行符不会拾取第11个标记中的随机编码值。

正则表达式Search（）。groups（）返回部分列表值Python

2 个答案: