我有一个需要使用Python解析的非结构化文件。在对文件进行一些初始操作后,数据将采用以下格式(标题只是假人,它们可以是诸如INDEX LENGTH,WIDTH等的任何内容)
data = [
[" title1-a", "title2-a", "title3-a", " title4-a"],
["title1-b ", " title2-b", "title3-b ", "title4-b"],
["title3-c", " title4-c "],
["title1-a ", " title5-a"],
["title1-b", " title5-b"],
["title5-c "]
]
以上数据为伪数据。真实的数据集如下所示
real = [
['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4'],
['TIME', 'WWIR'],
['DAYS', 'STB/DAY'],
['I1']
]
注意,每个标题是三个列表的缩写!因此,
real = [[
['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4']],[
['TIME', 'WWIR'],
['DAYS', 'STB/DAY'],
['I1']
]]
将对真实数据进行解析以获取以下字符串
TIME DAYS
YEARS YEARS
WWPR STB/DAY P1
WWPR STB/DAY P2
WWPR STB/DAY P3
WWPR STB/DAY P4
WOPR STB/DAY P1
WOPR STB/DAY P2
WOPR STB/DAY P3
WOPR STB/DAY P4
WWIR STB/DAY I1
目标如下
基于虚拟数据,所需的输出看起来像下面的
output = [
"title1-a title1-b",
"title2-a title2-b",
"title3-a title3-b title3-c",
"title4-a title4-b title4-c",
"title5-a title5-b title5-c"
]
我已经制定了解决方案。也就是说,必须有一种更清洁,更有效的方法。因此,我将热衷于研究替代解决方案。以下是我开发的代码,用于将上述数据转换为所需的输出格式。
def _getTitleData(title_data):
seen = set()
titleRows = 3
# bundle title row(s)
titles = [
title_data[index:index + titleRows]
for index in range(0, len(title_data), titleRows)
]
# apply padding to simplify concatination
for title in titles:
firstRow = title[0]
lastRow = title[len(title) - 1]
lengthFirstRow = len(firstRow)
lengthLastRow = len(lastRow)
if(lengthFirstRow > lengthLastRow):
for index in range(lengthFirstRow - lengthLastRow):
lastRow.insert(0, '')
# strip and concatinate titles
titles = [
' '.join(word).strip()
for title in titles
for word in zip(*title)
]
# remove duplicate entries
titles = [
title
for title in titles
if not (title in seen or seen.add(title))
]
[print(title) for title in titles]
return titles
答案 0 :(得分:2)
请看看我的建议:
data = [
[" title1-a", "title2-a", "title3-a", " title4-a"],
["title1-b ", " title2-b", "title3-b ", "title4-b"],
["title3-c", " title4-c "],
["title1-a ", " title5-a"],
["title1-b", " title5-b"],
["title5-c "]
]
unique = set()
for i in data:
for j in i:
unique.add(j.strip(" ") )
print(sorted(list(unique)))
答案 1 :(得分:1)
根据您提供的真实数据,这是我想出的解决方案:
real = [[
['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4']],[
['TIME', 'WWIR'],
['DAYS', 'STB/DAY'],
['I1']
]]
maxLists = 3
numOfSublists = len(real)
lengths = [len(elem[0]) for elem in real]
for i in range(numOfSublists):
real[i][2] = [' '] * (lengths[i]-len(real[i][2])) + real[i][2]
dups = set()
output = [" ".join(j) for i in range(numOfSublists) for j in list(zip(*real[i])) if not (" ".join(j) in dups or dups.add(" ".join(j)))]
for i in output:
print(i)
输出:
TIME DAYS
YEARS YEARS
WWPR STB/DAY P1
WWPR STB/DAY P2
WWPR STB/DAY P3
WWPR STB/DAY P4
WOPR STB/DAY P1
WOPR STB/DAY P2
WOPR STB/DAY P3
WOPR STB/DAY P4
WWIR STB/DAY I1