Question

我正在阅读一个大型文本文档并尝试拆分为多个列表。我真的很难实际分裂字符串。

案文的例子：

Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410

此数据包含以下格式的4条信息：

City[coordinates]Population Distances_to_previous

我的目标是将这些数据分成一个列表：

Data = [[City] , [Coordinates] , [Population] , [Distances]]

据我所知，我需要使用.split语句，但我已经迷失了尝试实现它们。

我非常感谢一些开始的想法！

Answer 1

我会分阶段这样做。

您的第一次拆分位于坐标的“[”。
您的第二次拆分位于坐标的']'。
第三次拆分是行尾。
下一行（如果以数字开头）是你的距离。

我从以下内容开始：

numCities = 0
Data = []

i = 0
while i < len(lines):
    split = lines[i].partition('[')
    if (split[1]): # We found something
        city = split[0]
        split = split[2].partition(']')
        if (split[1]):
            coords = split[0] #If you want this as a list then rsplit it
            population = split[2]

    distances = []
    if i > 0:
        i += 1
        distances = lines[i].rsplit(' ')

    Data.append([city, coords, population, distances])
    numCities += 1
    i += 1

for data in Data:
    print (data)

这将打印

['Youngstown, OH', '4110,8065', '115436', []]
['Yankton, SD', '4288,9739', '12011', ['966']]
['Yakima, WA', '4660,12051', '49826', ['1513', '2410']]

Answer 2

最简单的方法是使用正则表达式。

lines = """Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410"""

import re

pat = re.compile(r"""
    (?P<City>.+?)                  # all characters up to the first [
    \[(?P<Coordinates>\d+,\d+)\]   # grabs [(digits,here)]
    (?P<Population>\d+)            # population digits here
    \s                             # a space or a newline?
    (?P<Distances>[\d ]+)?         # Everything else is distances""", re.M | re.X)

groups = pat.finditer(lines)
results = [[[g.group("City")],
            [g.group("Coordinates")],
            [g.group("Population")],
            g.group("Distances").split() if 
                    g.group("Distances") else [None]]
            for g in groups]

样本：

In[50]: results
Out[50]: 
[[['Youngstown, OH'], ['4110,8065'], ['115436'], [None]],
 [['Yankton, SD'], ['4288,9739'], ['12011'], ['966']],
 [['Yakima, WA'], ['4660,12051'], ['49826'], ['1513', '2410']]]

虽然如果可以的话，最好将其作为词典列表来做。

groups = pat.finditer(lines)
results = [{key: g.group(key)} for g in groups for key in
                  ["City", "Coordinates", "Population", "Distances"]]
# then modify later
for d in results:
    try:
        d['Distances'] = d['Distances'].split()
    except AttributeError:
        # distances is None -- that's okay
        pass

将字符串拆分为多个列表

2 个答案: