Question

首先，目的是从由韩文名称，英文名称，特殊字符（-，*，逗号），空格等组成的字符串中仅区分名称，并且如果名称重复则仅保留一个。

所以，到目前为止，我已经完成了一个文本文件并将其转换为字符串，从而消除了不必要的特殊字符。

import re

path = 'E:\Data Science\Personal_Project\Church\Data\original.txt'

def open_text(path):
    with open(path, "r", encoding='euc-kr') as f:
        text = f.readlines()
        string = ''.join(text)
        unicode_line = string.translate({ord(c): None for c in '.;*\n'})
        cleaned = re.split('-|', unicode_line)


print(unicode_line, type(cleaned))
return(cleaned)

这是问题。我想在上方的功能中添加什么

1）如果虚线前有一个字母（例如“出勤--”），我想先将其前面的文本（即“出勤”）删除，然后再将其拆分成短划线。 / p>

2）或者，我想列出一个清单-[出勤，退房，休假]，并且我想删除清单中包含的单词。

如果您能告诉我一种更好的方法或更蟒蛇的方法，我将不胜感激！

为方便起见，我将添加示例文本。

Status of January 20th




** Attendance
-----------

John Smith, John Smith, Bob Smith, Mike Smith, Jane Jones, Daniel Lee, Dong Jones, Jeannie Jones, Jessica Yi, McAleer Chung, Shu K Smith, Song Kim, Steve Carlos, Bob Smith





** Absent
---------

holiday, unauthorized, unpaid leave, emergency
------------------------------------------------------------------------------------------- 
Brown Williams, Paul Garcia

此外，这是我想要的输出，仅包含不重复的名称。如果在上方看到，有两个约翰·史密斯和两个鲍勃·史密斯。最后，如果我能按字母顺序获得它，那就太好了。

Output:


John Smith, Bob Smith, Mike Smith, Jane Jones, Daniel Lee, Dong Jones, Jeannie Jones, Jessica Yi, McAleer Chung, Shu K Smith, Song Kim, Steve Carlos, Brown Williams, Paul Garcia

Answer 1

如果我对您的理解正确，那么您希望获取文档中所有名称的set，而某些标题行中没有单词，而预定义的非名称单词列表中也没有单词，例如“休假”。

首先，我建议不要加入所有行，然后例如检查行以-还是*开头，并排除该行。这也使跳过带有标题的第一行变得更加容易。然后，您只需定义非名称词列表，将文件中的行循环并用,分割即可。

non_names = set("holiday, unauthorized, unpaid leave, emergency".split(", "))
with open("text.txt") as f:
    next(f) # skip first line
    names = set()
    for line in f:
        if not line.startswith(("*", "-")):
            for name in line.strip().split(", "):
                if name and name not in non_names:
                    names.add(name)

或直接在复杂的生成器表达式上使用set：

    names = set(name for line in f
                     if not line.startswith(("*", "-"))
                     for name in line.strip().split(", ")
                     if name and name not in non_names)

两种方式的结果均为{'John Smith', 'Jeannie Jones', 'Mike Smith', 'Bob Smith', 'McAleer Chung', 'Steve Carlos', 'Brown Williams', 'Jessica Yi', 'Paul Garcia', 'Jane Jones', 'Shu K Smith', 'Song Kim', 'Daniel Lee', 'Dong Jones'}。要获得排序的名称，只需对set进行排序，或者如果要按姓氏排序，请使用特殊的key函数：

names = sorted(names, key=lambda s: s.split()[-1])

Answer 2

可能的解决方案：

假定文件格式与您给出的相同逐行浏览文件忽略所有第一个和第二个单词都不大写的行然后将该行作为名称列表处理

for line in file:
  words = line.split(",")

  #No one has just one name like Tupac
  if len(words) > 1:
    #Check to see if first letter of both words are uppercase
    if isUpper(words[0][0]) and isUpper(words[1][0]):
      #name line
      list_to_be_returned+=words

类似的东西

Answer 3

with open(filename)as file:
    words = file.read().split()

还可以使用正则表达式

import re

with open(filename)as file:
    words = re.findall(r'([\w]+)', file.read())

从python中的字符串中提取名称

3 个答案: