Question

我需要一些帮助，在文本文件中打印重复的姓氏（小写和大写应该相同）该程序不打印带有数字的单词（即如果数字出现在姓氏或名字中，则忽略整个名称）

例如：我的文本文件是：

mFusedLocationClient.getLastLocation()
    .addOnSuccessListener(this, new OnSuccessListener<Location>() {
        @Override
        public void onSuccess(Location location) {
            // Got last known location. In some rare situations this can be null.
            if (location != null) {
                // ...
            }
        }
    });

输出应为：

Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu

Assaf

Assaf

David

Bibi

Amnon

Ehud

========

Spanier

Levi

Answer 1

好的，首先让我们从一个开放的文件开始，以惯用的方式。使用with语句，保证将关闭您的文件。对于小脚本，这不是什么大问题，但如果你开始编写寿命较长的程序，由于错误关闭的文件导致的内存泄漏可能会让你感到困扰。由于您的文件包含所有内容：

with open(fname) as f:
    data = f.read()

该文件现已关闭。这也鼓励您立即处理您的文件，而不是让它打开消耗资源不必要。另外，让我们假设你做了有多行。不使用for line in f.readlines()，而是使用以下构造：

with open(fname) as f:
    for line in f:
        do_stuff(line)

由于您实际上不需要保留整个文件，并且只需要检查每一行，因此不要使用readlines()。如果您想保留一系列行，请仅使用readlines()，例如lines = f.readlines()。

好的，最后，数据看起来像这样：

>>> print(data)
Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu

好的，所以如果你想在这里使用正则表达式，我建议采用以下方法：

>>> names_regex = re.compile(r"^(\D+)\s(\D+)$")

此处的模式^(\D+)\s(\D+)$使用非数字组，\D（与\d相反，数字组）和白色 - 空间组\s。此外，它使用 anchors ，^和$将模式分别锚定到文本的开头和结尾。此外，括号创建捕获组，我们将利用它们。如果您仍然不理解，请尝试将其复制粘贴到http://regexr.com/并使用它。一个重要的注意事项，使用原始字符串，即r"this is a raw string"与普通字符串，"this is a normal string"（注意r）。这是因为Python字符串使用一些与正则表达式相同的转义字符。这有助于保持理智。好的，最后，我建议使用分组习惯用dict

>>> grouper = {}

现在，我们的循环：

>>> for fullname in data.split(','):
...     match = names_regex.search(fullname.strip())
...     if match:
...         first, last = match.group(1), match.group(2)
...         grouper.setdefault(last.title(), []).append(first.title())
...

注意，我使用.title方法将我们所有的名称规范化为“Titlecase”。 dict.setdefault接受一个键作为它的第一个参数，如果该键不存在，它将设置第二个参数作为值，并返回它。所以，我正在检查grouper dict中是否存在姓氏的姓氏，如果没有，则将其设置为空列表[]，然后append无论那里有什么！

为了清晰起见，现在打印漂亮：

>>> from pprint import pprint
>>> pprint(grouper)
{'Din': ['Assaf'],
 'Levi': ['David', 'Amnon'],
 'Netanyahu': ['Bibi'],
 'Spanier': ['Assaf', 'Ehud']}

这是一个非常有用的数据结构。例如，我们可以获得具有多个名字的所有姓氏：

>>> for last, firsts in grouper.items():
...     if len(firsts) > 1:
...         print(last)
...
Spanier
Levi

所以，把它们放在一起：

>>> grouper = {}
>>> names_regex = re.compile(r"^(\D+)\s(\D+)$")
>>> for fullname in data.split(','):
...     match = names_regex.search(fullname.strip())
...     if match:
...         first, last = match.group(1), match.group(2)
...         first, last = first.title(), last.title()
...         print(first)
...         grouper.setdefault(last, []).append(first)
...
Assaf
Assaf
David
Bibi
Amnon
Ehud
>>> for last, firsts in grouper.items():
...     if len(firsts) > 1:
...         print(last)
...
Spanier
Levi

注意，我假设顺序没关系，所以我使用了正常的dict。我的输出恰好是正确的顺序，因为在Python 3.6中，dict是有序的！但是不要依赖于此，因为它是一个实现细节而不是保证。如果您想保证订单，请使用collections.OrderedDict。

Answer 2

很好，因为你坚持使用正则表达式，你应该努力在一次调用中完成它，这样你就不会受到上下文切换的影响。最好的方法是编写一个模式来捕获所有不包含数字的名字/姓氏，用逗号分隔，让正则表达式引擎捕获它们然后迭代匹配，最后将它们映射到字典所以你可以将它们拆分为姓氏=＆gt;名字地图：

import collections
import re

text = "Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, " \
       "Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu"

full_name = re.compile(r"(?:^|\s|,)([^\d\s]+)\s+([^\d\s]+)(?=>$|,)")  # compile the pattern

matches = collections.OrderedDict()  # store for the last=>first name map preserving order
for match in full_name.finditer(text):
    first_name = match.group(1)
    print(first_name)  # print the first name to match your desired output
    last_name = match.group(2).title()  # capitalize the last name for case-insensitivity
    if last_name in matches:  # repeated last name
        matches[last_name].append(first_name)  # add the first name to the map
    else:  # encountering this last name for the first time
        matches[last_name] = [first_name]  # initialize the map for this last name
print("========")  # print the separator...
# finally, print all the repeated last names to match your format
for k, v in matches.items():
    if len(v) > 1:  # print only those with more than one first name attached
        print(k)

这会给你：

Assaf
Assaf
David
Bibi
Amnon
Ehud
========
Spanier
Levi

此外，您还有完整的姓氏=＆gt;名字在matches匹配。

说到模式，让我们一点一点地分解：

(?:^|\s|,) - match the beginning of the string, whitespace or a comma (non-capturing)
  ([^\d\,]+) - followed by any number of characters that are not not digits or whitespace
               (capturing)
    \s+  - followed by one or more whitespace characters (non-capturing)
      ([^\d\s]+) - followed by the same pattern as for the first name (capturing)
         (?=>$|,) - followed by a comma or end of the string  (look-ahead, non-capturing)

当我们迭代匹配时，在match对象中引用两个捕获的组（名字和姓氏）。易于peasy。

在字符串中查找重复的单词并使用re打印它们

2 个答案: