从行中提取名称

时间:2013-06-27 20:13:57

标签: python text-parsing

我的数据格式如下:

Bxxxx, Mxxxx F  Birmingham   AL (123) 555-2281  NCC Clinical Mental Health, Counselor Education, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029    -99.8115
Axxxx, Axxxx Brown  Birmingham   AL (123) 555-2281  NCC Clinical Mental Health, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029    -99.8115
Axxxx, Bxxxx    Mobile   AL (123) 555-8011  NCC Childhood & Adolescence, Clinical Mental Health, Sexual Abuse Recovery, Disaster Counseling English 99.68639    -99.053238
Axxxx, Rxxxx Lunsford   Athens   AL (123) 555-8119  NCC, NCCC, NCSC Career Development, Childhood & Adolescence, School, Disaster Counseling, Supervision   English 99.804501   -99.971283
Axxxx, Mxxxx    Mobile   AL (123) 555-5963  NCC Clinical Mental Health, Counselor Education, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling, Supervision   English 99.68639    -99.053238
Axxxx, Txxxx    Mountain Brook   AL (123) 555-3099  NCC Addictions and Dependency, Career Development, Childhood & Adolescence, Corrections/Offenders, Sexual Abuse Recovery    English 99.50214    -99.75557
Axxxx, Lxxxx    Birmingham   AL (123) 555-4550  NCC Addictions and Dependency, Eating Disorders English 99.52029    -99.8115
Axxxx, Wxxxx    Birmingham   AL (123) 555-2328  NCC     English 99.52029    -99.8115
Axxxx, Rxxxx    Mobile   AL (123) 555-9411  NCC Addictions and Dependency, Childhood & Adolescence, Couples & Family, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill English 99.68639    -99.053238

并且只需要提取人名。理想情况下,我可以使用humanName来获取包含字段name.firstname.middlename.lastname.title ...

的一组名称对象

我已经尝试迭代直到我点击表示状态的前两个连续大写字母,然后将之前的东西存储到列表中然后调用humanName但这是一场灾难。我不想继续尝试这种方法。

有没有办法感知单词的开头和结尾?这可能会有所帮助......

建议?

2 个答案:

答案 0 :(得分:1)

不是代码答案,但看起来您可以从http://www.abec.alabama.gov/rostersearch2.asp?search=%25&submit1=Search的许可委员会获得大部分/全部数据。名字很容易到达。

答案 1 :(得分:1)

您最好的选择是找到不同的数据源。认真。这个被吓坏了。

如果你不能这样做,那么我会做这样的工作:

  1. 用单个空格替换所有双倍空格。
  2. 按空格分割
  3. 获取列表中的最后两项。那些是lat和lng
  4. 在列表中向后循环,将每个项目查找到潜在语言列表中。如果查找失败,您将完成语言。
  5. 用空格加入剩余的列表项
  6. 在排队中,找到第一个开场白。读取大约13或14个字符,用空字符串替换所有标点符号,并将其重新格式化为普通电话号码。
  7. 用逗号分隔电话号码后面的剩余部分。
  8. 使用该拆分,循环遍历列表中的每个项目。如果文本以超过1个大写字母开头,请将其添加到认证中。否则,将其添加到练习区域。
  9. 回到步骤#6中找到的索引,直到那时为止。将其拆分为空格,并取最后一项。那是国家。剩下的就是名字和城市!
  10. 取空间分割线中的前2项。到目前为止,这是你对名字的最佳猜测。
  11. 看第3项。如果是单个字母,请将其添加到名称中并从列表中删除。
  12. 从这里下载US.zip:http://download.geonames.org/export/zip/US.zip
  13. 在美国数据文件中,将所有内容拆分为选项卡。获取索引2和4处的数据,这些数据是城市名称和州名缩写。循环遍历所有数据并插入每一行,将缩写+“:”+城市名称(即AK:Sand Point)连接成一个新列表。
  14. 以与步骤#13相同的格式组合行中剩余项目的所有可能连接。所以你最终得到了AL:布朗伯明翰和AL:伯明翰的第二线。
  15. 遍历每个组合并在步骤#13中创建的列表中搜索它。如果找到了,请将其从拆分列表中删除。
  16. 将字符串拆分列表中的所有剩余项目添加到此人的姓名。
  17. 如果需要,请在逗号上拆分名称。 index [0]是姓氏索引[1]是所有剩余的名称。不要对中间名做任何假设。
  18. 只是为了咯咯笑,我实现了这一点。享受。

    import itertools
    
    # this list of languages could be longer and should read from a file
    languages = ["English", "Spanish", "Italian", "Japanese", "French",
                 "Standard Chinese", "Chinese", "Hindi", "Standard Arabic", "Russian"]
    
    languages = [language.lower() for language in languages]
    
    # Loop through US.txt and format it. Download from geonames.org.
    cities = []
    with open('US.txt', 'r') as us_data:
        for line in us_data:
            line_split = line.split("\t")
            cities.append("{}:{}".format(line_split[4], line_split[2]))
    
    # This is the dataset
    with open('state-teachers.txt', 'r') as teachers:
        next(teachers)  # skip header
    
        for line in teachers:
            # Replace all double spaces with single spaces
            while line.find("  ") != -1:
                line = line.replace("  ", " ")
    
            line_split = line.split(" ")
    
            # Lat/Lon are the last 2 items
            longitude = line_split.pop().strip()
            latitude = line_split.pop().strip()
    
            # Search for potential languages and trim off the line as we find them
            teacher_languages = []
    
            while True:
                language_check = line_split[-1]
                if language_check.lower().replace(",", "").strip() in languages:
                    teacher_languages.append(language_check)
                    del line_split[-1]
                else:
                    break
    
            # Rejoin everything and then use phone number as the special key to split on
            line = " ".join(line_split)
    
            phone_start = line.find("(")
            phone = line[phone_start:phone_start+14].strip()
    
            after_phone = line[phone_start+15:]
    
            # Certifications can be recognized as acronyms
            # Anything else is assumed to be an area of practice
            certifications = []
            areas_of_practice = []
    
            specialties = after_phone.split(",")
            for specialty in specialties:
                specialty = specialty.strip()
                if specialty[0:2].upper() == specialty[0:2]:
                    certifications.append(specialty)
                else:
                    areas_of_practice.append(specialty)
    
            before_phone = line[0:phone_start-1]
            line_split = before_phone.split(" ")
    
            # State is the last column before phone
            state = line_split.pop()
    
            # Name should be the first 2 columns, at least. This is a basic guess.
            name = line_split[0] + " " + line_split[1]
    
            line_split = line_split[2:]
    
            # Add initials
            if len(line_split[0].strip()) == 1:
                name += " " + line_split[0].strip()
                line_split = line_split[1:]
    
            # Combo of all potential word combinations to see if we're dealing with a city or a name
            combos = [" ".join(combo) for combo in set(itertools.permutations(line_split))] + line_split
    
            line = " ".join(line_split)
            city = ""
    
            # See if the state:city combo is valid. If so, set it and let everything else be the name
            for combo in combos:
                if "{}:{}".format(state, combo) in cities:
                    city = combo
                    line = line.replace(combo, "")
                    break
    
            # Remaining data must be a name
            if line.strip() != "":
                name += " " + line
    
            # Clean up names
            last_name, first_name = [piece.strip() for piece in name.split(",")]
    
            print first_name, last_name