提取多行的列数据

时间:2014-08-31 21:31:57

标签: python

我有一个像这样设置的文件:

; start item 1
; item 1 line ; start item 2; start item 3
; item 1 line ; item 2 line ; item 3 line ; start item 4
; item 1 line ; item 2 line ; item 3 line ; item 4 line ; start item 5
; item 1 line ; item 2 line
; item 1 line 
; item 1 line

; item 6 start
; item 6 line ; item 7 start
; item 6 line ; item 7 line ; item 8 start
; item 6 line ;; item 8 line
; item 6 line
; item 6 line ; item 9 start
; item 6 line ; item 9 line
; item 6 line
; item 6 line ; item 0 start
; item 6 line ; item 0 line
;; item 0 line
;; item 0 line

(想象一下,这些列是不同的人,行就是他们所说的 - 有几行的行是几个人同时说话。)

我正在尝试解析这个,所以我可以分别得到每个项目,但我只是部分成功。这是我的方法:

def unpacker(File):

    Values = {}
    main_key = 0
    sep = ';'
    with open(File)as fn:
        for line in fn:
            if line.count(sep):
                for i, sub_line in enumerate(line.split(sep)):

                    sub_key=str(main_key)+'_'+str(i)
                    sub_line=sub_line.replace('\n','')

                    if Values.get(sub_key):
                        Values[sub_key]+=('|'+sub_line)
                    else:
                        Values[sub_key]=sub_line

            else:main_key+=1

    for k in Values.keys():
        print k, '---------'
        print Values[k]

其输出带有示例数据:

1_3 ---------
 item 8 start| item 8 line
1_2 ---------
 item 7 start| item 7 line || item 9 start| item 9 line| item 0 start| item 0 line| item 0 line| item 0 line
1_1 ---------
 item 6 start| item 6 line | item 6 line | item 6 line | item 6 line| item 6 line | item 6 line | item 6 line| item 6 line | item 6 line ||
1_0 ---------

0_4 ---------
 start item 4| item 4 line 
0_5 ---------
 start item 5
0_2 ---------
 start item 2| item 2 line | item 2 line | item 2 line
0_3 ---------
 start item 3| item 3 line | item 3 line 
0_0 ---------

0_1 ---------
 start item 1| item 1 line | item 1 line | item 1 line | item 1 line | item 1 line | item 1 line

如果每个项目中尚未包含其自己的键,则会在其中分配。每行中的行长度可能不同,但分号将始终采用该模式。

此方法适用于上述示例的第一部分(第1至第5项),但未能在后半部分(第6项以后)将项目7,9和0分开。如果7,9和0相关,该方法将起作用,但它们不相关。我在这一点上已经陷入困境,如何区分这些项目。

1 个答案:

答案 0 :(得分:1)

这是一个代码,用于处理您的示例。您可能已经根据实际用例进行了调整:

class Speaker(list):
    def __init__(self):
        list.__init__(self)
        self.activated = True

    def talk(self, string):
        if self.activated :
            talk = string.replace("\n", "")
            if talk :
                self.append(talk)
            else:
                self.activated = False

        return self.activated


class SpeakerIndex(dict):
    def __init__(self, filepath, separator):
        """ Creation of index """
        dict.__init__(self)
        self.separator = separator

        self.talk = 0

        self.toSpeak = []
        self.hadSpeak = []
        with open(filepath, 'r') as data:
            for line in data:
                ##print("line: ",line)
                ##print("toSpeak: ",self.toSpeak)
                self.speakersFeed(line)
                #save and remove person tha should have speak
                for speaker in self.toSpeak:
                    self.save_speaker(speaker)
                self.toSpeak = self.hadSpeak
                self.hadSpeak = []

    def speakersFeed(self, line):
        """ parse a line """
        if self.separator in line:
            for speaker_action in line.split(self.separator)[1:]:
                ##print("action :",speaker_action)
                speaker = None
                #Take the good speaker
                if self.toSpeak:
                    speaker = self.toSpeak.pop(0)
                else:
                    speaker = Speaker()
                #process the content
                result = speaker.talk(speaker_action)
                ##print("speaker : ",speaker)
                #put the speaker where is needed depending of its state
                if result :
                    self.hadSpeak.append(speaker)
                else:
                    self.save_speaker(speaker)
        else:
            #save speaker that may be not ended at this point
            for speaker in self.toSpeak:
                self.save_speaker(speaker)
            self.talk +=1

    def speaker_id(self, speaker):
        """ Return an unique Id for speakers """
        number = int(speaker[0].split(" ")[2])
        return "talk{0}-speaker{1}".format(self.talk, number)

    def save_speaker(self, speaker):
        self[self.speaker_id(speaker)]=speaker
        ##print("saved :",speaker)

    def __str__(self):
        """ override the str() comportment """
        keylist = list(self.keys())
        keylist.sort()
        result = "{\n"
        for key in keylist:
            result += "\t" + str(key) + " : " + str(self[key]) + "\n"
        result += "}"
        return result           


if __name__ == "__main__":
    index = SpeakerIndex("foo.txt", ";")
    print(str(index))

您可以取消注释打印行以获取执行跟踪。这些课程背后的想法是随时保持一堆发言者。

执行给我这个:

python3 ./sof.py 
{
    talk0-speaker1 : [' item 1 start', ' item 1 line ', ' item 1 line ', ' item 1 line ', ' item 1 line ', ' item 1 line ', ' item 1 line']
    talk0-speaker2 : [' item 2 start ', ' item 2 line ', ' item 2 line ', ' item 2 line']
    talk0-speaker3 : [' item 3 start', ' item 3 line ', ' item 3 line ']
    talk0-speaker4 : [' item 4 start', ' item 4 line ']
    talk0-speaker5 : [' item 5 start']
    talk1-speaker0 : [' item 0 start', ' item 0 line', ' item 0 line']
    talk1-speaker1 : [' item 1 start', ' item 1 line ', ' item 1 line ', ' item 1 line ', ' item 1 line ', ' item 1 line ', ' item 1 line']
    talk1-speaker6 : [' item 6 start', ' item 6 line ', ' item 6 line ', ' item 6 line ', ' item 6 line', ' item 6 line ', ' item 6 line ', ' item 6 line', ' item 6 line ', ' item 6 line ']
    talk1-speaker7 : [' item 7 start', ' item 7 line ']
    talk1-speaker8 : [' item 8 start', ' item 8 line']
    talk1-speaker9 : [' item 9 start', ' item 9 line']
}