我的表:
New York 3 books 1000
London 2,25 2000
Paris 1.000 apples 3000
30 4000
Berlin newspapers
我想保留表中的空字段,用xxxx
值填充它们并将整个表放在列表中。
New York 3 books 1000
London 2,25 xxxx 2000
Paris 1.000 apples 3000
xxxx 30 xxxx 4000
Berlin xxxx newspapers xxxx
我所做的就是拿起每一条线并拆分它们。
finallist = []
for line in range(1,6):
listtemp = re.split("\s{2,}", line)
finallist .append(listtemp)
然后我压缩了列表
zippedlist = zip(*finallist)
检查列的长度(现在是行)是否有足够的元素并添加缺少的元素xxxx
添加结尾,但这不起作用,因为它会压缩列(行分割不会拾取列中的空白区域)
如何使用xxxx
元素填充表格并将其放入如下列表中:
[['New York','3','books','1000'],['London','2,25','xxxx','2000'],['Paris','1.000','apples','3000'],['xxxx','30','xxxx','4000'],['Berlin','xxxx','newspapers','xxxx']]
另一张表可能是:
New York 3 books 1000
London 2,25 2000
Paris 1.000 3000
30 4000
Berlin apples newspapers
这两个答案都没有给出解决方案,但我用两者来找到一个不同的解决方案(经过大量的尝试和尝试......)
#list of all lines
r = ['New York 3 books 1000 ', ' London 2,25 2000 ', ' Paris 1.000 3000 ', ' 30 4000 ', ' Berlin apples newspapers ']
#split list
separator = "\s{2,}"
mylist = []
for i in range(0,len(r)):
mylisttemp = re.split(separator, r[i].strip())
mylist.append(mylisttemp)
#search for column matches
p = regex.compile("^(?<=\s*)\S|(?<=\s{2,})\S")
i = []
for n in range(0,len(r)):
itemp = []
for m in p.finditer(r[n]):
itemp.append(m.start())
i.append(itemp)
#find out which matches are on next lines comparing the column match with all the matches of first line (the one with the smallest difference is the match).
i_currentcols = []
i_0_indexes = list(range(0,len(i[0])))
for n in range(1,len(mylist)):
if len(i[n]) == len(i[0]):
continue
else:
i_new = []
for b in range(0,len(i[n])):
difference = []
for c in range(0,len(i[0])): #first line is always correct
difference.append(abs(i[0][c]-i[n][b]))
i_new.append(difference.index(min(difference)))
i_notinside = sorted([elem for elem in i_0_indexes if elem not in i_new ], key=int)
#add linenr.
i_notinside.insert(0, str(n))
i_currentcols.append(i_notinside)
#insert missing fields in list
for n in range(0,len(i_currentcols)):
for i in range(1,len(i_currentcols[n])):
mylist[int(i_currentcols[n][0])].insert(i_currentcols[n][i], "xxxx")
答案 0 :(得分:1)
这一直非常具有挑战性,但我提出了两个步骤的解决方案:
这里的复杂性是在某些行中列是空的。
方法是:每个双空格后跟一个非空格字符标识新的列开始。 0始终是列开始。从每一行开始搜索每一列:
t = """New York 3 books 1000
London 2,25 2000
Paris 1.000 apples 3000
30 4000
Berlin newspapers """
p = re.compile(" [^ ]")
i = set([0])
for line in t.split('\n'):
for m in p.finditer(line):
i.add(m.start()+2)
i = sorted(i)
输出:[0,10,18,31]
def split_line_by_indexes( indexes, line ):
tokens=[]
indexes = indexes + [len(line)]
for i1,i2 in zip(indexes[:-1], indexes[1:]): #pairs
tokens.append( line[i1:i2].rstrip() )
return tokens
for line in t.split('\n'):
print split_line_by_indexes(i, line)
输出:
['New York', '3', 'books', '1000']
['London', '2,25', '', '2000']
['Paris', '1.000', 'apples', '3000']
['', '30', '', '4000']
['Berlin', '', 'newspapers', '']
当然,您可以使用xxxx
替换空值而不是打印,然后将其写回文件
答案 1 :(得分:1)
这是一个非常有趣的问题。我想出了以下简洁的代码。 它基本上是3行。给定
s = """New York 3 books 1000
London 2,25 2000
Paris 1.000 3000
30 4000
Berlin apples newspapers"""
reg = r'^([\w\s]*?)\s+([\d.,]*?)\s+([\w]*?)\s+([\d]*?)$'
pat = re.compile(reg)
lines = s.splitlines()
# lines could be an `open()` file object
g = (pat.search(line).groups() for line in lines)
result = ([i if i else "xxx" for i in t] for t in g)
# consume the result generator
In [197]: list(result)
Out[197]:
[['New York', '3', 'books', '1000'],
['London', '2,25', 'xxx', '2000'],
['Paris', '1.000', 'apples', '3000'],
['xxx', '30', 'xxx', '4000'],
['Berlin', 'xxx', 'newspapers', 'xxx']]
看看它是否适合您。如果确实如此,请发表评论,以便我可以继续告诉您如何使其健壮和高效。
答案 2 :(得分:1)
我找到了另一个解决方案,它比我之前的答案更容易理解和更通用。
我搜索每行space
的位置
t = """New York 3 books 1000
London 2,25 2000
Paris 1.000 apples 3000
30 4000
Berlin newspapers """
p = re.compile(" ")
i = None
for line in t.split('\n'):
thisline = set()
for m in p.finditer(line):
thisline.add(m.start()+2)
print sorted(thisline)
if not i:
i = thisline
else:
i.intersection_update(thisline)
i = sorted(i)
然后我详细说明索引以将后续索引压缩到同一索引中,以便[10, 11, 17, 18, 19, 30, 31, 32]
变为[10, 17, 30]
res = []
last = None
for el in i:
if not last or el != last + 1:
res.append(el)
last = el
与之前相同
def split_line_by_indexes( indexes, line ):
tokens=[]
indexes = indexes + [len(line)]
for i1,i2 in zip(indexes[:-1], indexes[1:]): #pairs
tokens.append( line[i1:i2].rstrip() )
return tokens
for line in t.split('\n'):
print split_line_by_indexes(i, line)
这不完美也不完整。您需要修剪结果,绝对可以优化代码。
我也看到你找到了解决方案,但我真的很想发布这个,因为我觉得值得一试