解析空行分隔文件 - 太慢了

时间:2017-04-22 07:43:25

标签: python

我几天都在努力解析国际象棋游戏中使用的text / pgm文件。您可以假设这是一个.txt文件。每2个街区代表一场比赛。该文件类似于:

[Event "15th Czerniak mem"]
[Site "Tel Aviv ISR"]
[Date "1999.04.05"]
[Round "6"]
[White "Karolyi, T"]
[Black "Lutz, C"]
[Result "0-1"]
[WhiteElo "2432"]
[BlackElo "2610"]
[ECO "A87"]

1.d4 f5 2.c4 Nf6 3.g3 g6 4.Bg2 Bg7 5.Nc3 O-O 6.Nf3 d6 7.d5 Qe8 8.Be3 Na6 
9.Qc1 e5 10.dxe6 Bxe6 11.O-O c6 12.b3 Ng4 13.Bf4 Qe7 14.Qd2 Rad8 15.Rad1 
Nc5 16.Nd4 Ne5 17.Bg5 Bf6 18.Bxf6 Qxf6 19.f4 Nf7 20.b4 Bxc4 21.Nxc6 bxc6 
22.bxc5 d5 23.Rfe1 d4 24.Na4 Rfe8 25.Rc1 Bd5 26.Bxd5 Rxd5 27.Nb2 Nh6 28.
Nd3 Ng4 29.Nf2 Ne3 30.Nd1 g5 31.Nxe3 Rxe3 32.Rf1 Qe6 33.Rf2 g4 34.Rb1 Rd7 
35.Qc2 Kg7 36.Rb3 Kf6 37.Rxe3 Qxe3 38.Qb2 Qc3 39.Qb8 Qxc5 40.Qc8 Qd5 41.
Rf1 Qe6 0-1

[Event "Danilo Batricevic Mem Balkan GP"]
[Site "Cetinje MNE"]
[Date "2012.10.24"]
[Round "6.6"]
[White "Nikcevic, N"]
[Black "Blagojevic, Dr"]
[Result "1/2-1/2"]
[WhiteElo "2432"]
[BlackElo "2526"]
[ECO "A14"]
[EventDate "2012.10.20"]
[WhiteTitle "GM"]
[BlackTitle "GM"]
[Opening "English opening"]
[Variation "Agincourt variation"]
[WhiteFideId "901776"]
[BlackFideId "900885"]

1.c4 e6 2.Nf3 d5 3.b3 Nf6 4.g3 Be7 5.Bg2 c5 6.O-O Nc6 7.e3 O-O 8.Bb2 b6 9.
Nc3 dxc4 10.bxc4 Bb7 11.Qe2 Rc8 12.Rac1 Qc7 13.d4 Na5 14.Nb5 Qb8 15.dxc5 
Bxc5 16.Bxf6 gxf6 17.Rfd1 Rfd8 18.Nh4 Bxg2 19.Nxg2 Nc6 20.Qg4+ Kh8 21.
Rxd8+ Nxd8 22.Qh4 Be7 23.Rd1 Nc6 24.Qh5 Kg8 25.Qg4+ Kh8 26.Qh5 Kg8 27.Qg4+
Kh8 1/2-1/2

[Event "FSGM October"]
[Site "Budapest HUN"]
[Date "2003.10.09"]
[Round "6"]
[White "Anka, E"]
[Black "Taylor, T"]
[Result "1-0"]
[WhiteElo "2432"]
[BlackElo "2385"]
[ECO "C41"]

1.e4 e5 2.Nf3 d6 3.d4 exd4 4.Nxd4 g6 5.Nc3 Bg7 6.Be3 Nf6 7.Qd2 O-O 8.O-O-O
Re8 9.f3 Nc6 10.g4 Ne5 11.Be2 a6 12.Bh6 Bh8 13.h4 b5 14.h5 b4 15.Nd5 c5 
16.Nf5 Nxd5 17.Qxd5 Be6 18.Qxd6 Qf6 19.g5 Nd3+ 20.Kd2 Qd8 21.hxg6 fxg6 22.
Bxd3 gxf5 23.Qxd8 Raxd8 24.Kc1 c4 25.exf5 cxd3 26.fxe6 Rxe6 27.Rd2 Rc6 28.
Kb1 Rcd6 29.Rxd3 Rxd3 30.cxd3 Rxd3 31.Rc1 Kf7 32.Rc8 Bd4 33.Kc2 1-0
.......
.......

上面你看到总共3场比赛(== 6个部分)! 每个游戏分为2个部分(一个以[Event "15th Czer...."开头,另一个以1.d4 f5 2.f4...)< - 一个游戏开头。我是尝试解析一个类似于此的非常大的文件(~2 GB)。这是我尝试过的,但这甚至需要超过10分钟来解析一个简单的文件,非常慢:

with open('data/chessgames.pgn', 'r', encoding='latin-1') as f:
    file_in_memory = f.read() # ~ 2 gb file in memory
    items = file_in_memory.strip().split('\n\n') #split on empty new lines
    list_of_games = []

item = 0
while item < len(items):
    summary = items[item].split("\n")  # str -> [Event "FIDE Candidates"]\n [Site "London ENG"] \n [Date "some date"....
    moves = items[item + 1]  # [ 1.e4 e5 2.Nf3 Nc6 3.Bb5 Nf6 4.d3 ...
    if summary[0][0] != "[":   # sometime if I come across ill informed data
        item += 1
        continue
    item += 2
    #print(item)
    parser(summary, moves, list_of_games)  # function to parse the data

insert_into_database(list_of_games)  # BATCH INSERT (not inserting one at a time) array of ALL parsed games to be inserted in a sqlite3 db where each row is one game. Happy to share this function if anyone wants to see

花费最多时间的parser函数:

def parser(summary, moves, list_of_games):
    row = {'Event': None, 'Site': None, 'Date': None, 'Round': None,
           'White': None, 'Black': None, 'Result': None,
           'WhiteElo': None, 'BlackElo': None,
           'ECO': None, 'EventDate': None, 'Opening': None,
           'Variation': None}

    for item in summary: # list -> [[Event "FIDE Candidates"], [Site "London ENG"],['Date....
        try:
            key, val = shlex.split(item[1:-1].strip())  # exclude the first and last brackets
            row[key] = val  # exclude the first and last double quotes
        except Exception as e:
            print("ERRORSS")
            logger.error("Exception raised: error parsing", e)

    chess = ChessSchema(row, moves)  # create an object of ChessSchema
    list_of_games.append(chess.ret_tuple_of_game())

如何使这个代码成为'正确'的方式,因为即使是中等大小的文件也会花费太长时间。我怎样才能快速完成?我尝试过生成器,但即使我将整个文件保存在内存中,它应该是最快的,正如我所假设的那样。 最后,这是我的ChessSchema课程的样子:

class ChessSchema:
    """
    :param: parses 14 fields
    """
    def __init__(self, game_dict, moves):

        self.Event = game_dict['Event']
        self.Site = game_dict['Site']
        self.Date = game_dict['Date']
        self.White = game_dict['White']  # name
        self.Black = game_dict['Black']  # name
        self.Round = game_dict['Round']
        self.Result = game_dict['Result']
        self.ECO = game_dict['ECO']
        self.EventDate = game_dict['EventDate']
        self.Opening = game_dict['Opening']
        self.Variation = game_dict['Variation']
        self.moves = moves
        try:
            self.WhiteElo = int(game_dict['WhiteElo']) # white score
            self.BlackElo = int(game_dict['BlackElo'])  # black score
        except TypeError:
            self.WhiteElo = game_dict['WhiteElo']  # white score
            self.BlackElo = game_dict['BlackElo']  # black score
        self.winner = self.winner()

    def winner(self):
        if self.Result:
            if self.Result == "0-1":
                return "Black"
            elif self.Result == "1-0":
                return "White"
            else:
                return "Draw"

    def ret_tuple_of_game(self):
        return (self.Event, self.Site, self.Date, self.White, self.Black,
                self.Result, self.WhiteElo, self.BlackElo, self.ECO, self.EventDate,
                self.Opening, self.Variation, self.moves, self.winner)

更新:我理解这里有一个pgn库,正如评论中提到的那样,但我试图将其解析为简单的文本文件,而不使用外部库(或至少试一试!) 因此,基于有关逐行读取文件的注释中的反馈,而不是将整个文件转储到内存中,我尝试了但是失败了,因为通过查看第一个字符来验证一个部分是唯一的方法。如果它是“1”那么它的好处,但如果它不是,那么我不想计算以下行。以下是否比以上更好或更快?

line = f.readline()
    while line: # reach till EOF

        print("LINE:" , line)
        #######################################################################
        # parse game headers [Event "Fide inv"]\n [Site "Fullerham"]....
            # print("game headers")

        if line.startswith("["):
            # print("LINES: ", line)
            while line.startswith("["):
                print(line)
                key, val = shlex.split(line[1:-2].rstrip())
                row[key] = val
                line = f.readline()

            # print(key, val)

            # else:
            #    break  # break out of the summary brackets

        if line.isspace():  # or if it starts with % or "\n"
            line = f.readline()
            all_games.append(row)
            # del all_games[:]
            # continue

        # print(line)
        analyze_data()

        #######################################################################
        # Parse moves info - move text
        if line.startswith("1"):
            tmp = ""
            while line and not line.isspace():
                line = f.readline()
                tmp += line
            row['moves'] = tmp

        line = f.readline()

0 个答案:

没有答案