基于逗号和空格(python)在文本文件中拆分文本

时间:2015-08-19 12:41:44

标签: python parsing split

我需要将文本文件解析为两类:

  1. 大学
  2. 位置(例如:拉合尔,白沙瓦,Jamshoro,费萨拉巴德)
  3. 但文本文件包含以下文字:

    "Imperial College of Business Studies, Lahore"
    "Government College University Faisalabad"
    "Imperial College of Business Studies Lahore"
    "University of Peshawar, Peshawar"
    "University of Sindh, Jamshoro"
    "London School of Economics"
    "Lahore School of Economics, Lahore"
    

    我编写了基于“逗号”分隔位置的代码。以下代码仅适用于第一行文件并打印' Lahore'之后,它会给出以下错误'列表索引超出范围'。

    file = open(path,'r')
    content = file.read().split('\n')
    
    for line in content:
        rep = line.replace('"','')
        loc = rep.split(',')[1]
        print "uni: "+replace
        print "Loc: "+str(loc)
    

    请帮助我坚持这一点。 感谢

4 个答案:

答案 0 :(得分:0)

您的输入文件在每一行都没有逗号,导致代码失败。你可以做点什么

if ',' in line:
    loc = rep.split(',')[1].strip()
else:
    loc = rep.split()[-1].strip()

以不同的方式处理没有逗号的行,或者只是重新格式化输入。

答案 1 :(得分:0)

你可以用逗号分割,结果总是一个列表,你可以检查它的大小,如果它是多个,那么你已经至少有一个逗号,否则(如果大小是一个)你没有'有任何逗号

>>> word = "somethign without a comma"
>>> afterSplit = word.split(',')
>>> afterSplit
['somethign without a comma']
>>> word2 = "something with, just one comma"
>>> afterSplit2 = word2.split(',')
>>> afterSplit2
['something with', ' just one comma']

答案 2 :(得分:0)

我希望这会奏效,但我无法得到伦敦'虽然。可能是数据应该是不变的。

f_data = open('places.txt').readlines()
stop_words = ['school', 'Economics', 'University', 'College']
places = []
for p in f_data:
    p = p.replace('"', '')
    if ',' in p:
        city = p.split(',')[-1].strip()
    else:
        city = p.split(' ')[-1].strip()
    if city not in places and city not in stop_words:
            places.append(city)
print places

o / p ['拉合尔',' Faisalabad',' Lahore',' Peshawar',' Jamshoro']

答案 3 :(得分:0)

如果有逗号,您似乎只能确定某行是否有位置。因此,在两遍中解析文件是有意义的。第一遍可以构建一个set来保存所有已知位置。您可以通过一些已知示例或问题案例开始此操作。

传递两个也可以使用逗号来匹配已知位置,但如果没有逗号,则该行被分成一组单词。这些与位置集的交集应该为您提供位置。如果没有交叉点,则标记为“未知”。

locations = set(["London", "Faisalabad"])

with open(path, 'r') as f_input:
    unknown = 0
    # Pass 1, build a set of locations
    for line in f_input:
        line = line.strip(' ,"\n')
        if ',' in line:
            loc = line.rsplit("," ,1)[1].strip()
            locations.add(loc)

    # Pass 2, try and find location in line
    f_input.seek(0)

    for line in f_input:
        line = line.strip(' "\n')
        if ',' in line:
            uni, loc = line.rsplit("," ,1)
            loc = loc.strip()
        else:
            uni = line
            loc_matches = set(re.findall(r"\b(\w+)\b", line)).intersection(locations)

            if loc_matches:
                loc = list(loc_matches)[0]
            else:
                loc = "<unknown location>"
                unknown += 1

        uni = uni.strip()

        print "uni:", uni
        print "Loc:", loc

    print "Unknown locations:", unknown

输出将是:

uni: Imperial College of Business Studies
Loc: Lahore
uni: Government College University Faisalabad
Loc: Faisalabad
uni: Imperial College of Business Studies Lahore
Loc: Lahore
uni: University of Peshawar
Loc: Peshawar
uni: University of Sindh
Loc: Jamshoro
uni: London School of Economics
Loc: London
uni: Lahore School of Economics
Loc: Lahore
Unknown locations: 0