使用Python进行数据处理

时间:2018-08-10 15:06:45

标签: python data-manipulation

我有一个需要使用Python解析的非结构化文件。在对文件进行一些初始操作后,数据将采用以下格式(标题只是假人,它们可以是诸如INDEX LENGTH,WIDTH等的任何内容)

data = [
    [" title1-a", "title2-a", "title3-a", " title4-a"], 
    ["title1-b ", " title2-b", "title3-b ", "title4-b"], 
    ["title3-c", " title4-c  "],
    ["title1-a ", "  title5-a"],
    ["title1-b", " title5-b"],
    ["title5-c "]
]

以上数据为伪数据。真实的数据集如下所示

real = [
    ['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
    ['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
    ['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4'],
    ['TIME', 'WWIR'],
    ['DAYS', 'STB/DAY'],
    ['I1']
]

注意,每个标题是三个列表的缩写!因此,

real = [[
    ['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
    ['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
    ['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4']],[
    ['TIME', 'WWIR'],
    ['DAYS', 'STB/DAY'],
    ['I1']
]]

将对真实数据进行解析以获取以下字符串

TIME DAYS
YEARS YEARS
WWPR STB/DAY P1
WWPR STB/DAY P2
WWPR STB/DAY P3
WWPR STB/DAY P4
WOPR STB/DAY P1
WOPR STB/DAY P2
WOPR STB/DAY P3
WOPR STB/DAY P4
WWIR STB/DAY I1

目标如下

  1. 合并关联的标题条目;
  2. 标题顺序必须保留;
  3. 不允许重复;
  4. 尽可能减少复制操作;

基于虚拟数据,所需的输出看起来像下面的

output = [
    "title1-a title1-b", 
    "title2-a title2-b",
    "title3-a title3-b title3-c",
    "title4-a title4-b title4-c",
    "title5-a title5-b title5-c"
]

我已经制定了解决方案。也就是说,必须有一种更清洁,更有效的方法。因此,我将热衷于研究替代解决方案。以下是我开发的代码,用于将上述数据转换为所需的输出格式。

def _getTitleData(title_data):
    seen = set()
    titleRows = 3

    # bundle title row(s)
    titles = [
                 title_data[index:index + titleRows] 
                 for index in range(0, len(title_data), titleRows)
             ]

    # apply padding to simplify concatination
    for title in titles:
        firstRow = title[0]
        lastRow = title[len(title) - 1]

        lengthFirstRow = len(firstRow)
        lengthLastRow = len(lastRow)

        if(lengthFirstRow > lengthLastRow):
            for index in range(lengthFirstRow - lengthLastRow):
                lastRow.insert(0, '')

    # strip and concatinate titles
    titles = [
                 ' '.join(word).strip() 
                 for title in titles 
                 for word in zip(*title)
             ]

    # remove duplicate entries
    titles = [
                 title 
                 for title in titles 
                 if not (title in seen or seen.add(title))
             ]

    [print(title) for title in titles]
    return titles

2 个答案:

答案 0 :(得分:2)

请看看我的建议:

data = [
    [" title1-a", "title2-a", "title3-a", " title4-a"], 
    ["title1-b ", " title2-b", "title3-b ", "title4-b"], 
    ["title3-c", " title4-c  "],
    ["title1-a ", "  title5-a"],
    ["title1-b", " title5-b"],
    ["title5-c "]
]

unique = set()

for i in data:
    for j in i:
        unique.add(j.strip(" ") )

print(sorted(list(unique)))

答案 1 :(得分:1)

根据您提供的真实数据,这是我想出的解决方案:

real = [[
    ['TIME', 'YEARS', 'WWPR', 'WWPR', 'WWPR', 'WWPR', 'WOPR', 'WOPR', 'WOPR', 'WOPR'],
    ['DAYS', 'YEARS', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY', 'STB/DAY'],
    ['P1', 'P2', 'P3', 'P4', 'P1', 'P2', 'P3', 'P4']],[
    ['TIME', 'WWIR'],
    ['DAYS', 'STB/DAY'],
    ['I1']
]]

maxLists = 3
numOfSublists = len(real)
lengths = [len(elem[0]) for elem in real]
for i in range(numOfSublists):
    real[i][2] = [' '] * (lengths[i]-len(real[i][2])) + real[i][2]

dups = set()
output = [" ".join(j) for i in range(numOfSublists) for j in list(zip(*real[i])) if not (" ".join(j) in dups or dups.add(" ".join(j)))]
for i in output:
    print(i)

输出:

TIME DAYS  
YEARS YEARS  
WWPR STB/DAY P1
WWPR STB/DAY P2
WWPR STB/DAY P3
WWPR STB/DAY P4
WOPR STB/DAY P1
WOPR STB/DAY P2
WOPR STB/DAY P3
WOPR STB/DAY P4
WWIR STB/DAY I1