我的数据格式如下:
Bxxxx, Mxxxx F Birmingham AL (123) 555-2281 NCC Clinical Mental Health, Counselor Education, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029 -99.8115
Axxxx, Axxxx Brown Birmingham AL (123) 555-2281 NCC Clinical Mental Health, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029 -99.8115
Axxxx, Bxxxx Mobile AL (123) 555-8011 NCC Childhood & Adolescence, Clinical Mental Health, Sexual Abuse Recovery, Disaster Counseling English 99.68639 -99.053238
Axxxx, Rxxxx Lunsford Athens AL (123) 555-8119 NCC, NCCC, NCSC Career Development, Childhood & Adolescence, School, Disaster Counseling, Supervision English 99.804501 -99.971283
Axxxx, Mxxxx Mobile AL (123) 555-5963 NCC Clinical Mental Health, Counselor Education, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling, Supervision English 99.68639 -99.053238
Axxxx, Txxxx Mountain Brook AL (123) 555-3099 NCC Addictions and Dependency, Career Development, Childhood & Adolescence, Corrections/Offenders, Sexual Abuse Recovery English 99.50214 -99.75557
Axxxx, Lxxxx Birmingham AL (123) 555-4550 NCC Addictions and Dependency, Eating Disorders English 99.52029 -99.8115
Axxxx, Wxxxx Birmingham AL (123) 555-2328 NCC English 99.52029 -99.8115
Axxxx, Rxxxx Mobile AL (123) 555-9411 NCC Addictions and Dependency, Childhood & Adolescence, Couples & Family, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill English 99.68639 -99.053238
并且只需要提取人名。理想情况下,我可以使用humanName来获取包含字段name.first
,name.middle
,name.last
,name.title
...
我已经尝试迭代直到我点击表示状态的前两个连续大写字母,然后将之前的东西存储到列表中然后调用humanName但这是一场灾难。我不想继续尝试这种方法。
有没有办法感知单词的开头和结尾?这可能会有所帮助......
建议?
答案 0 :(得分:1)
不是代码答案,但看起来您可以从http://www.abec.alabama.gov/rostersearch2.asp?search=%25&submit1=Search的许可委员会获得大部分/全部数据。名字很容易到达。
答案 1 :(得分:1)
您最好的选择是找到不同的数据源。认真。这个被吓坏了。
如果你不能这样做,那么我会做这样的工作:
只是为了咯咯笑,我实现了这一点。享受。
import itertools
# this list of languages could be longer and should read from a file
languages = ["English", "Spanish", "Italian", "Japanese", "French",
"Standard Chinese", "Chinese", "Hindi", "Standard Arabic", "Russian"]
languages = [language.lower() for language in languages]
# Loop through US.txt and format it. Download from geonames.org.
cities = []
with open('US.txt', 'r') as us_data:
for line in us_data:
line_split = line.split("\t")
cities.append("{}:{}".format(line_split[4], line_split[2]))
# This is the dataset
with open('state-teachers.txt', 'r') as teachers:
next(teachers) # skip header
for line in teachers:
# Replace all double spaces with single spaces
while line.find(" ") != -1:
line = line.replace(" ", " ")
line_split = line.split(" ")
# Lat/Lon are the last 2 items
longitude = line_split.pop().strip()
latitude = line_split.pop().strip()
# Search for potential languages and trim off the line as we find them
teacher_languages = []
while True:
language_check = line_split[-1]
if language_check.lower().replace(",", "").strip() in languages:
teacher_languages.append(language_check)
del line_split[-1]
else:
break
# Rejoin everything and then use phone number as the special key to split on
line = " ".join(line_split)
phone_start = line.find("(")
phone = line[phone_start:phone_start+14].strip()
after_phone = line[phone_start+15:]
# Certifications can be recognized as acronyms
# Anything else is assumed to be an area of practice
certifications = []
areas_of_practice = []
specialties = after_phone.split(",")
for specialty in specialties:
specialty = specialty.strip()
if specialty[0:2].upper() == specialty[0:2]:
certifications.append(specialty)
else:
areas_of_practice.append(specialty)
before_phone = line[0:phone_start-1]
line_split = before_phone.split(" ")
# State is the last column before phone
state = line_split.pop()
# Name should be the first 2 columns, at least. This is a basic guess.
name = line_split[0] + " " + line_split[1]
line_split = line_split[2:]
# Add initials
if len(line_split[0].strip()) == 1:
name += " " + line_split[0].strip()
line_split = line_split[1:]
# Combo of all potential word combinations to see if we're dealing with a city or a name
combos = [" ".join(combo) for combo in set(itertools.permutations(line_split))] + line_split
line = " ".join(line_split)
city = ""
# See if the state:city combo is valid. If so, set it and let everything else be the name
for combo in combos:
if "{}:{}".format(state, combo) in cities:
city = combo
line = line.replace(combo, "")
break
# Remaining data must be a name
if line.strip() != "":
name += " " + line
# Clean up names
last_name, first_name = [piece.strip() for piece in name.split(",")]
print first_name, last_name