步骤1：检测列开始位置

Question

我的表：

New York  3       books        1000
London    2,25                 2000
Paris     1.000   apples       3000
          30                   4000
Berlin            newspapers

我想保留表中的空字段，用xxxx值填充它们并将整个表放在列表中。

New York  3       books        1000
London    2,25    xxxx         2000
Paris     1.000   apples       3000
xxxx      30      xxxx         4000
Berlin    xxxx    newspapers   xxxx

我所做的就是拿起每一条线并拆分它们。

finallist = []
for line in range(1,6):
   listtemp = re.split("\s{2,}", line)
   finallist .append(listtemp)

然后我压缩了列表

zippedlist = zip(*finallist)

检查列的长度（现在是行）是否有足够的元素并添加缺少的元素xxxx添加结尾，但这不起作用，因为它会压缩列（行分割不会拾取列中的空白区域）

如何使用xxxx元素填充表格并将其放入如下列表中：

[['New York','3','books','1000'],['London','2,25','xxxx','2000'],['Paris','1.000','apples','3000'],['xxxx','30','xxxx','4000'],['Berlin','xxxx','newspapers','xxxx']]

另一张表可能是：

New York      3         books   1000  
  London      2,25              2000  
   Paris  1.000                 3000  
             30                 4000  
  Berlin  apples    newspapers

的更新

这两个答案都没有给出解决方案，但我用两者来找到一个不同的解决方案（经过大量的尝试和尝试......）

#list of all lines
r = ['New York      3         books   1000  ', '  London      2,25              2000  ', '   Paris  1.000                 3000  ', '             30                 4000  ', '  Berlin  apples    newspapers ']

#split list
separator = "\s{2,}"
mylist = []
for i in range(0,len(r)):
   mylisttemp = re.split(separator, r[i].strip())
   mylist.append(mylisttemp)

#search for column matches
p = regex.compile("^(?<=\s*)\S|(?<=\s{2,})\S") 

i = []
for n in range(0,len(r)):
   itemp = []
   for m in p.finditer(r[n]):
      itemp.append(m.start())
   i.append(itemp)

#find out which matches are on next lines comparing the column match with all the matches of first line (the one with the smallest difference is the match). 
i_currentcols = []
i_0_indexes = list(range(0,len(i[0])))
for n in range(1,len(mylist)):
   if len(i[n]) == len(i[0]):
      continue
   else:
      i_new = []
      for b in range(0,len(i[n])):
         difference = []
         for c in range(0,len(i[0])): #first line is always correct
             difference.append(abs(i[0][c]-i[n][b]))
         i_new.append(difference.index(min(difference)))
      i_notinside = sorted([elem for elem in i_0_indexes if elem not in i_new ], key=int)
      #add linenr.
      i_notinside.insert(0, str(n))
      i_currentcols.append(i_notinside)

#insert missing fields in list
for n in range(0,len(i_currentcols)):
    for i in range(1,len(i_currentcols[n])):
       mylist[int(i_currentcols[n][0])].insert(i_currentcols[n][i], "xxxx")

Answer 1

这一直非常具有挑战性，但我提出了两个步骤的解决方案：

步骤1：检测列开始位置

这里的复杂性是在某些行中列是空的。

方法是：每个双空格后跟一个非空格字符标识新的列开始。 0始终是列开始。从每一行开始搜索每一列：

t = """New York  3       books        1000
London    2,25                 2000
Paris     1.000   apples       3000
          30                   4000
Berlin            newspapers """

p = re.compile("  [^ ]")

i = set([0])
for line in t.split('\n'):
    for m in p.finditer(line):
        i.add(m.start()+2)
i = sorted(i)

输出：[0,10,18,31]

第2步：对这些位置上的每一行进行标记化

def split_line_by_indexes( indexes, line ):
    tokens=[]
    indexes = indexes + [len(line)]
    for i1,i2 in zip(indexes[:-1], indexes[1:]): #pairs
        tokens.append( line[i1:i2].rstrip() )
    return tokens

for line in t.split('\n'):
    print split_line_by_indexes(i, line)

输出：

['New York', '3', 'books', '1000']
['London', '2,25', '', '2000']
['Paris', '1.000', 'apples', '3000']
['', '30', '', '4000']
['Berlin', '', 'newspapers', '']

当然，您可以使用xxxx替换空值而不是打印，然后将其写回文件

Answer 2

这是一个非常有趣的问题。我想出了以下简洁的代码。 它基本上是3行。给定

s = """New York      3         books   1000  
       London      2,25                2000  
         Paris     1.000                 3000  
                  30                   4000  
       Berlin  apples    newspapers"""

reg = r'^([\w\s]*?)\s+([\d.,]*?)\s+([\w]*?)\s+([\d]*?)$'
pat = re.compile(reg)
lines = s.splitlines()
# lines could be an `open()` file object
g = (pat.search(line).groups() for line in lines)
result = ([i if i else "xxx" for i in t] for t in g)
# consume the result generator
In [197]: list(result)
Out[197]:
[['New York', '3', 'books', '1000'],
 ['London', '2,25', 'xxx', '2000'],
 ['Paris', '1.000', 'apples', '3000'],
 ['xxx', '30', 'xxx', '4000'],
 ['Berlin', 'xxx', 'newspapers', 'xxx']]

看看它是否适合您。如果确实如此，请发表评论，以便我可以继续告诉您如何使其健壮和高效。

Answer 3

我找到了另一个解决方案，它比我之前的答案更容易理解和更通用。

第1步：找到切片的位置

我搜索每行space的位置

t = """New York  3       books        1000
London    2,25                 2000
Paris     1.000   apples       3000
          30                   4000
Berlin            newspapers """

p = re.compile(" ")

i = None
for line in t.split('\n'):
    thisline = set()
    for m in p.finditer(line):
        thisline.add(m.start()+2)
    print sorted(thisline)
    if not i:
        i = thisline
    else:
        i.intersection_update(thisline)
i = sorted(i)

然后我详细说明索引以将后续索引压缩到同一索引中，以便[10, 11, 17, 18, 19, 30, 31, 32]变为[10, 17, 30]

res = []
last = None
for el in i:
    if not last or el != last + 1:
        res.append(el)
    last = el

第2步：对这些位置上的每一行进行标记化

与之前相同

def split_line_by_indexes( indexes, line ):
    tokens=[]
    indexes = indexes + [len(line)]
    for i1,i2 in zip(indexes[:-1], indexes[1:]): #pairs
        tokens.append( line[i1:i2].rstrip() )
    return tokens

for line in t.split('\n'):
    print split_line_by_indexes(i, line)

结论

这不完美也不完整。您需要修剪结果，绝对可以优化代码。

我也看到你找到了解决方案，但我真的很想发布这个，因为我觉得值得一试

如何填写表格中缺少的元素？

的更新

3 个答案:

步骤1：检测列开始位置

第2步：对这些位置上的每一行进行标记化

第1步：找到切片的位置

第2步：对这些位置上的每一行进行标记化

结论