在通用列表中拆分unicode字符串

时间:2018-07-25 15:49:16

标签: python string unicode split

所以我的数据如下:

data = {"technology1": [
       [
       20, 0.02,
      u'10.00,106.10,107.00,107.00,0.45',
      u'24.00,-47.15,-49.50,-51.00,0.12',
      u'11.00,0.35,0.00,0.00,0.92',
      u'0.00',0.04,0.16, u'0.223196881092', u'f',0.02,
     ], 
      [
       100, 0.02,
  u'10.00,106.10,107.00,107.00,0.45',
  u'24.00,-47.15,-49.50,-51.00,0.12',
  u'11.00,0.35,0.00,0.00,0.92', u'0.00', 0.04,
  0.16, u'0.223196881092',  u'f', 0.01
   ] ... ],

       "technology2": ...}

如您所见,它是一本字典,每个键都访问一个列表列表,所有列表都具有相同的格式。每个“内部”列表都包含整数,浮点数。还有unicode字符串,其中一些带有单个值,有些在unicode字符串中带有一组5个数字。

我想要什么:

为每种技术制作一个阵列。在每个数组中,行将是上面的“外部”列表,列将是“内部列表”的不同元素。理想情况下,需要将unicode转换为字符串(因为我知道如何更好地使用它们),并且unicode字符串中5个数字的集合需要扩展为每个元素。

即技术阵列1

20, 0.02, 10.00, 106.10, ... "f", 0.02
100, 0.02, ...            "f", 0.01

到目前为止尝试:

for tech in data:

    features = data[tech] # i.e. grab technologyn
    for row in features:
        for i in row[2:5]: # 2 til 5 defines the instance which are sets of 5
            #print i,"\n"
            i = str(i)
            i = i.split(',')

这不起作用,当我在代码执行后查看功能时,它是完全一样的!

这不是一个完整解决方案的尝试,因为它显然不会将所有unicode类型转换为字符串,但这是一个垫脚石。 我还尝试这样使用列表理解:

for row in features:
    [i.split(',') for i in row if (type(i)==unicode and "," in i)]

2 个答案:

答案 0 :(得分:1)

您需要为每行创建一个新的列表对象,然后替换原始列表值:

def row_to_values(row):
    values = []
    for col in row:
        if isinstance(col, unicode) and col != u'f':
            # split and convert all entries to float
            values += (float(v) for v in col.split(','))
        else:
            values.append(col)
    return values

for value in data.values():
    value[:] = [row_to_values(row) for row in value]

value[:] = ...分配告诉Python将列表对象 中包含的所有索引替换为一组新对象。由于每个value都是data词典中的外部列表,因此将所有子列表替换为扩展行。

演示部分样本数据:

>>> data = {"technology1": [
...        [
...        20, 0.02,
...       u'10.00,106.10,107.00,107.00,0.45',
...       u'24.00,-47.15,-49.50,-51.00,0.12',
...       u'11.00,0.35,0.00,0.00,0.92',
...       u'0.00',0.04,0.16, u'0.223196881092', u'f',0.02,
...      ],
...       [
...        100, 0.02,
...   u'10.00,106.10,107.00,107.00,0.45',
...   u'24.00,-47.15,-49.50,-51.00,0.12',
...   u'11.00,0.35,0.00,0.00,0.92', u'0.00', 0.04,
...   0.16, u'0.223196881092',  u'f', 0.01
...    ]],
... }
>>> from pprint import pprint
>>> pprint(data["technology1"][0])
[20,
 0.02,
 u'10.00,106.10,107.00,107.00,0.45',
 u'24.00,-47.15,-49.50,-51.00,0.12',
 u'11.00,0.35,0.00,0.00,0.92',
 u'0.00',
 0.04,
 0.16,
 u'0.223196881092',
 u'f',
 0.02]
>>> pprint(row_to_values(data["technology1"][0]))
[20,
 0.02,
 10.0,
 106.1,
 107.0,
 107.0,
 0.45,
 24.0,
 -47.15,
 -49.5,
 -51.0,
 0.12,
 11.0,
 0.35,
 0.0,
 0.0,
 0.92,
 0.0,
 0.04,
 0.16,
 0.223196881092,
 u'f',
 0.02]

因此,通过返回新列表对象的函数调用,可以扩展一行以包含字符串中的所有浮点值。

使用该函数替换所有字典值中的所有行:

>>> for value in data.values():
...     value[:] = [row_to_values(row) for row in value]
...

我们可以看到之前查看的第一行已更新:

>>> pprint(data["technology1"][0])
[20,
 0.02,
 10.0,
 106.1,
 107.0,
 107.0,
 0.45,
 24.0,
 -47.15,
 -49.5,
 -51.0,
 0.12,
 11.0,
 0.35,
 0.0,
 0.0,
 0.92,
 0.0,
 0.04,
 0.16,
 0.223196881092,
 u'f',
 0.02]

字典的其余部分也是如此:

>>> pprint(data)
{'technology1': [[20,
                  0.02,
                  10.0,
                  106.1,
                  107.0,
                  107.0,
                  0.45,
                  24.0,
                  -47.15,
                  -49.5,
                  -51.0,
                  0.12,
                  11.0,
                  0.35,
                  0.0,
                  0.0,
                  0.92,
                  0.0,
                  0.04,
                  0.16,
                  0.223196881092,
                  u'f',
                  0.02],
                 [100,
                  0.02,
                  10.0,
                  106.1,
                  107.0,
                  107.0,
                  0.45,
                  24.0,
                  -47.15,
                  -49.5,
                  -51.0,
                  0.12,
                  11.0,
                  0.35,
                  0.0,
                  0.0,
                  0.92,
                  0.0,
                  0.04,
                  0.16,
                  0.223196881092,
                  u'f',
                  0.01]]}

答案 1 :(得分:0)

我提出了清单理解繁重的解决方案。如果转换与任务目标不完全匹配,请在下面发表评论。内联解释为代码段中的注释:

def split_or_wrap(item):
    """Split if str, wrap if number."""
    if isinstance(item, str):
        return item.split(',')
    elif isinstance(item, int) or isinstance(item, float):
        return [item]
    else:
        raise Exception("Unxpected item.")


def try_to_convert(item):
    """Try to convert string into in, then into float or leave as is"""
    try:
        return int(item)
    except:
        try:
            return float(item)
        except:
            return item


# initial list contains values' side of data dictionary
initial_list = [item for item in data.values()]

# flattened list contains list of lists where each list
# corresponds to single technology
flattened_list = [[item 
                   for tech_list in outer_list
                   for item in tech_list] 
                  for outer_list in initial_list]

# deconstructed list takes unicode strings and splits them.
# To make resulting elements consistently nested into lists
# we take single elements and put also in a list.
# This enables us to treat all lists similarly on final flattening step.
deconstructed_list = [[split_or_wrap(tech_item)
                       for tech_item in tech_list] 
                      for tech_list in flattened_list]

# final list contains array of arrays where each array
# contains single numbers (if they are convertible).
# This is done through flattening the so called item-wrapper
# lists into the list corresponding to a particular technology.
final_list = [[try_to_convert(item)
               for item_wrapper in tech_list
               for item in item_wrapper]
             for tech_list in deconstructed_list]


print(final_list)

输出:

[[20, 0, 10.0, 106.1, 107.0, 107.0, 0.45, 24.0, -47.15, -49.5, -51.0, 0.12, 11.0, 0.35, 0.0, 0.0, 0.92, 0.0, 0, 0, 0.223196881092, 'f', 0, 100, 0, 10.0, 106.1, 107.0, 107.0, 0.45, 24.0, -47.15, -49.5, -51.0, 0.12, 11.0, 0.35, 0.0, 0.0, 0.92, 0.0, 0, 0, '0.223196881092f', 0], 
 [100, 0, 10.0, 106.1, 107.0, 107.0, 0.45]]