匹配州和城市可能有多个单词

时间:2018-01-01 10:46:57

标签: python regex python-3.x

我有一个类似以下元素的Python列表:

['Alabama[edit]',
 'Auburn (Auburn University)[1]',
 'Florence (University of North Alabama)',
 'Jacksonville (Jacksonville State University)[2]',
 'Livingston (University of West Alabama)[2]',
 'Montevallo (University of Montevallo)[2]',
 'Troy (Troy University)[2]',
 'Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]',
 'Tuskegee (Tuskegee University)[5]',
 'Alaska[edit]',
 'Fairbanks (University of Alaska Fairbanks)[2]',
 'Arizona[edit]',
 'Flagstaff (Northern Arizona University)[6]',
 'Tempe (Arizona State University)',
 'Tucson (University of Arizona)',
 'Arkansas[edit]',
 'Arkadelphia (Henderson State University, Ouachita Baptist University)[2]',
 'Conway (Central Baptist College, Hendrix College, University of Central Arkansas)[2]',
 'Fayetteville (University of Arkansas)[7]']

列表不完整,但足以让您了解其中的内容。

数据的结构如下:

有一个美国州的名字,并且跟着州名,有一些城市的名字在那里。正如您所见,州名称以“[edit]”结尾,城市名称以带数字的括号结尾(例如“1”或“[2]”),或者括号内的大学名称(例如“(北阿拉巴马大学)”)。

(查找此问题的完整参考文件here

理想情况下,我希望使用状态名称作为索引的Python字典,以及嵌套中该状态中所有城市的名称列为该特定索引的值。所以,例如字典应该是:

{'Alabama': ['Auburn', 'Florence', 'Jacksonville'...], 'Arizona': ['Flagstaff', 'Temple', 'Tucson', ....], ......}

现在,我尝试了以下解决方案,以清除不必要的部分:

import numpy as np
import pandas as pd

    def get_list_of_university_towns():
        '''
        Returns a DataFrame of towns and the states they are in from the 
        university_towns.txt list. The format of the DataFrame should be:
        DataFrame( [ ["Michigan", "Ann Arbor"], ["Michigan", "Yipsilanti"] ], 
        columns=["State", "RegionName"]  )

        The following cleaning needs to be done:

        1. For "State", removing characters from "[" to the end.
        2. For "RegionName", when applicable, removing every character from " (" to the end.
        3. Depending on how you read the data, you may need to remove newline character '\n'. 

        '''

        fhandle = open("university_towns.txt")
        ftext = fhandle.read().split("\n")

        reftext = list()
        for item in ftext:
            reftext.append(item.split(" ")[0])

        #pos = reftext[0].find("[")
        #reftext[0] = reftext[0][:pos]

        towns = list()
        dic = dict()

        for item in reftext:
            if item == "Alabama[edit]":
                state = "Alabama"

            elif item.endswith("[edit]"):
                dic[state] = towns
                towns = list()
                pos = item.find("[")
                item = item[:pos]
                state = item

            else:
                towns.append(item)

        return ftext

    get_list_of_university_towns()

我的代码生成的输出片段如下所示:

{'Alabama': ['Auburn',
  'Florence',
  'Jacksonville',
  'Livingston',
  'Montevallo',
  'Troy',
  'Tuscaloosa',
  'Tuskegee'],
 'Alaska': ['Fairbanks'],
 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'],
 'Arkansas': ['Arkadelphia',
  'Conway',
  'Fayetteville',
  'Jonesboro',
  'Magnolia',
  'Monticello',
  'Russellville',
  'Searcy'],
 'California': ['Angwin',
  'Arcata',
  'Berkeley',
  'Chico',
  'Claremont',
  'Cotati',
  'Davis',
  'Irvine',
  'Isla',
  'University',
  'Merced',
  'Orange',
  'Palo',
  'Pomona',
  'Redlands',
  'Riverside',
  'Sacramento',
  'University',
  'San',
  'San',
  'Santa',
  'Santa',
  'Turlock',
  'Westwood,',
  'Whittier'],
 'Colorado': ['Alamosa',
  'Boulder',
  'Durango',
  'Fort',
  'Golden',
  'Grand',
  'Greeley',
  'Gunnison',
  'Pueblo,'],
 'Connecticut': ['Fairfield',
  'Middletown',
  'New',
  'New',
  'New',
  'Storrs',
  'Willimantic'],
 'Delaware': ['Dover', 'Newark'],
 'Florida': ['Ave',
  'Boca',
  'Coral',
  'DeLand',
  'Estero',
  'Gainesville',
  'Orlando',
  'Sarasota',
  'St.',
  'St.',
  'Tallahassee',
  'Tampa'],
 'Georgia': ['Albany',
  'Athens',
  'Atlanta',
  'Carrollton',
  'Demorest',
  'Fort',
  'Kennesaw',
  'Milledgeville',
  'Mount',
  'Oxford',
  'Rome',
  'Savannah',
  'Statesboro',
  'Valdosta',
  'Waleska',
  'Young'],
 'Hawaii': ['Manoa'],

但是,输出中有一个错误:名称中有空格的国家(例如“北卡罗莱纳州”)不包括在内。我可以说它背后的原因。

我想过使用正则表达式,但由于我还没有研究它们,我不知道如何形成它。关于如何使用或不使用Regex完成任何想法?

4 个答案:

答案 0 :(得分:4)

然后赞美正则表达式的力量:

 myFilesAdapter = new MyFilesAdapter(getContext(), new MyFilesAdapter.MyFilesItemClickListener() {
        @Override
        public void folderonclicklistener(FolderModel name, int position) {



            MyFilesSongs myFilesSongs = new MyFilesSongs();

            FragmentManager fm = getChildFragmentManager();
            fm.popBackStackImmediate(null, FragmentManager.POP_BACK_STACK_INCLUSIVE);
            FragmentTransaction fragmentTransaction = fm.beginTransaction();
            Bundle bundle = new Bundle();
            bundle.putString("parentPath",name.getFolderPath());
            myFilesSongs.setArguments(bundle);
            fragmentTransaction.replace(R.id.container,myFilesSongs);
            fragmentTransaction.setTransition(FragmentTransaction.TRANSIT_FRAGMENT_OPEN);

            fragmentTransaction.addToBackStack(null);
            fragmentTransaction.commit();
        }
    });

这会产生

states_rx = re.compile(r'''
^
(?P<state>.+?)\[edit\]
(?P<cities>[\s\S]+?)
(?=^.*\[edit\]$|\Z)
''', re.MULTILINE | re.VERBOSE)

cities_rx = re.compile(r'''^[^()\n]+''', re.MULTILINE)

transformed = '\n'.join(lst_)

result = {state.group('state'): [city.group(0).rstrip() 
        for city in cities_rx.finditer(state.group('cities'))] 
        for state in states_rx.finditer(transformed)}
print(result)

<小时/>

说明:

我们的想法是将任务分成几个较小的任务:

  1. 使用{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee'], 'Alaska': ['Fairbanks'], 'Arizona': ['Flagstaff', 'Tempe', 'Tucson'], 'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville']}
  2. 加入完整列表
  3. 分开状态
  4. 分开的城镇
  5. 对所有找到的项目使用dict理解
  6. <小时/> 第一个子任务

    \n

    第二个子任务

    transformed = '\n'.join(your_list)
    

    请参阅the demo on regex101.com

    第三个子任务

    ^                      # match start of the line
    (?P<state>.+?)\[edit\] # capture anything in that line up to [edit]
    (?P<cities>[\s\S]+?)   # afterwards match anything up to
    (?=^.*\[edit\]$|\Z)    # ... either another state or the very end of the string
    

    请参阅another demo on regex101.com

    第四个子任务

    ^[^()\n]+              # match start of the line, anything not a newline character or ( or )
    

    这大致相当于:

    result = {state.group('state'): [city.group(0).rstrip() for city in cities_rx.finditer(state.group('cities'))] for state in states_rx.finditer(transformed)}
    

    <小时/> 最后,一些时间问题:

    for state in states_rx.finditer(transformed):
        # state is in state.group('state')
        for city in cities_rx.finditer(state.group('cities')):
            # city is in city.group(0), possibly with whitespaces
            # hence the rstrip
    

    因此,运行上面的 100.000 次,我在计算机上花了12秒钟,所以它应该相当快。

答案 1 :(得分:3)

你[c / sh]应该改变

fhandle = open("university_towns.txt")
ftext = fhandle.read().split("\n") 

# to

with open("university_towns.txt","r") as f:
    d = f.readlines()

# file is autoclosed here, lines are autosplit by readlines()

没有正则表达式解决方案:

def save(state,city,dic):
    '''convenience fnkt to add or create set entry with list of city'''
    if state in dic:
        dic[state].append(city)
    else:
        dic[state] = [] # fix for glitch

dic = {}
state = "" 

with open("university_towns.txt","r") as f:
    d = f.readlines()  

for n in d:                                         # iterate all lines
    if "[edit]" in n:                                   # handles states
        act_state = n.replace("[edit]","").strip()      # clean up state
        # needed in case 2 states w/o cities follow right after each other
        save(act_state,"", dic)                         # create state in dic, no cities
        state = n.replace("[edit]","").strip()      # clean up state
    else:
        # splits at ( takes first and splits at [ takes first removes blanks
        #   => get city name before ( or [
        city = n.split("(")[0].split("[")[0].strip()  
        save(state,city,dic)                            # adds city to state in dic

print (dic)

收益率(重新格式化):

{
 'Alabama' : ['Auburn', 'Florence', 'Jacksonville', 'Livingston',
              'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee'], 
 'Alaska'  : ['Fairbanks'], 
 'Arizona' : ['Flagstaff', 'Tempe', 'Tucson'], 
 'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville']
}

答案 2 :(得分:3)

让我们一步一步解决您的问题:

  

第一步:

收集所有数据,我在这里使用跟踪词,只要有任何州名称出现,就会发出一个单词&#39; pos_flag&#39;所以在这个词的帮助下,我们将跟踪和分块:

import re
pattern='\w+(?=\[edit\])'

track=[]
with open('mon.txt','r') as f:
    for line in f:
        match=re.search(pattern,line)
        if match:
            track.append('pos_flag')
            track.append(line.strip().split('[')[0])
        else:

            track.append(line.strip().split('(')[0])

它会给出类似这样的输出:

['pos_flag', 'Alabama', 'Auburn ', 'Florence ', 'Jacksonville ', 'Livingston ', 'Montevallo ', 'Troy ', 'Tuscaloosa ', 'Tuskegee ', 'pos_flag', 'Alaska', 'Fairbanks ', 'pos_flag', 'Arizona', 'Flagstaff ', 'Tempe ', 'Tucson ', 'pos_flag', 'Arkansas', 'Arkadelphia ', 'Conway ', 'Fayetteville ', 'Jonesboro ', 'Magnolia ', 'Monticello ', 'Russellville ', 'Searcy ', 'pos_flag', 

正如你在每个州名前看到的那样,有一个单词&#39; pos_flag&#39;现在让我们用这个词来做一些事情:

  

第二步:

跟踪所有&#39; pos_flag字词的索引&#39;在列表中:

index_no=[]
for index,value in enumerate(track):
    if value=='pos_flag':
        index_no.append(index)

这将输出如下内容:

[0, 10, 13, 18, 28, 55, 66, 75, 79, 93, 111, 114, 119, 131, 146, 161, 169, 182, 192, 203, 215, 236, 258, 274, 281, 292, 297, 306, 310, 319, 331, 338, 371, 391, 395, 419, 432, 444, 489, 493, 506, 512, 527, 551, 559, 567, 581, 588, 599, 614]

我们现在已经索引否,我们可以使用这些索引号来链接链接:

  

最后一步:

使用索引号对列表进行分块,并将第一个单词设置为dict键,其余为dict值:

city_dict={}
for i in range(0,len(index_no),1):
    try:
        value_1=track[index_no[i:i + 2][0]:index_no[i:i + 2][1]]
        city_dict[value_1[1]]=value_1[2:]
    except IndexError:
        city_dict[track[index_no[i:i + 2][0]:][1]]=track[index_no[i:i + 2][0]:][1:]

print(city_dict)

输出:

因为dict没有在python 3.5中排序所以输出顺序与输入文件不同:

{'Kentucky': ['Bowling Green ', 'Columbia ', 'Georgetown ', 'Highland Heights ', 'Lexington ', 'Louisville ', 'Morehead ', 'Murray ', 'Richmond ', 'Williamsburg ', 'Wilmore '], 'Mississippi': ['Cleveland ', 'Hattiesburg ', 'Itta Bena ', 'Oxford ', 'Starkville '], 'Wisconsin': ['Appleton ', 'Eau Claire ', 'Green Bay ', 'La Crosse ', 'Madison ', 'Menomonie ', 'Milwaukee ', 

full_code:

import re
pattern='\w+(?=\[edit\])'

track=[]
with open('mon.txt','r') as f:
    for line in f:
        match=re.search(pattern,line)
        if match:
            track.append('pos_flag')
            track.append(line.strip().split('[')[0])
        else:

            track.append(line.strip().split('(')[0])


index_no=[]
for index,value in enumerate(track):
    if value=='pos_flag':
        index_no.append(index)


city_dict={}
for i in range(0,len(index_no),1):
    try:
        value_1=track[index_no[i:i + 2][0]:index_no[i:i + 2][1]]
        city_dict[value_1[1]]=value_1[2:]
    except IndexError:
        city_dict[track[index_no[i:i + 2][0]:][1]]=track[index_no[i:i + 2][0]:][1:]

print(city_dict)
  

第二个解决方案:

如果你想使用正则表达式,那么尝试这个小解决方案:

import re
pattern='((\w+\[edit\])(?:(?!^\w+\[edit\]).)*)'
with open('file.txt','r') as f:
    prt=re.finditer(pattern,f.read(),re.DOTALL | re.MULTILINE)

    for line in prt:
        dict_p={}
        match = []
        match.append(line.group(1))
        dict_p[match[0].split('\n')[0].strip().split('[')[0]]= [i.split('(')[0].strip() for i in match[0].split('\n')[1:][:-1]]

        print(dict_p)

它会给出:

{'Alabama': ['Auburn', 'Florence', 'Jacksonville', 'Livingston', 'Montevallo', 'Troy', 'Tuscaloosa', 'Tuskegee']}
{'Alaska': ['Fairbanks']}
{'Arizona': ['Flagstaff', 'Tempe', 'Tucson']}
{'Arkansas': ['Arkadelphia', 'Conway', 'Fayetteville', 'Jonesboro', 'Magnolia', 'Monticello', 'Russellville', 'Searcy']}
{'California': ['Angwin', 'Arcata', 'Berkeley', 'Chico', 'Claremont', 'Cotati', 'Davis', 'Irvine', 'Isla Vista', 'University Park, Los Angeles', 'Merced', 'Orange', 'Palo Alto', 'Pomona', 'Redlands', 'Riverside', 'Sacramento', 'University District, San Bernardino', 'San Diego', 'San Luis Obispo', 'Santa Barbara', 'Santa Cruz', 'Turlock', 'Westwood, Los Angeles', 'Whittier']}
{'Colorado': ['Alamosa', 'Boulder', 'Durango', 'Fort Collins', 'Golden', 'Grand Junction', 'Greeley', 'Gunnison', 'Pueblo, Colorado']}

demo :

答案 3 :(得分:2)

我试图消除超过one regex的需要。

import re

def mkdict(data):
  state, dict = None, {}
  rx = re.compile(r'^(?:(.+\[edit\])|([^\(\n:]+))', re.M)
  for m in rx.finditer(data):
    if m.groups()[0]:
      state = m.groups()[0].rstrip('[edit]')
      dict[state] = []
    else:
      dict[state].append(m.groups()[1].rstrip())
  return dict

if __name__ == '__main__':
  import sys, timeit, functools
  data = sys.stdin.read()
  print(timeit.Timer(functools.partial(mkdict, data)).timeit(10**3))
  print(mkdict(data))

Try it online