Question

我使用以下代码来获取数据，因为文本中的数据有两种不同的结构，我需要做出一些判断。以下代码可以使用，但我认为它不是一个好的代码。

我是RE的初学者，我搜索了一些文章，但我还没有找到改进它的方法。

如何优化以下代码？

import re
import html
import json

filepath="D:/Response.txt"
data=open(filepath,'r', encoding='utf-16').read()

rex1 = "msgList = '({.*?})'"
rex2='"general_msg_list":"({.*?})"'

def get_art(data,rex):
    pattern = re.compile(pattern=rex, flags=re.S)
    match = pattern.search(data)
    if match:
        data = match.group(1).replace('\\','')
        # there is some difference for data.
        if rex=="msgList = '({.*?})'":
            data = html.unescape(data)
        data = json.loads(data)
        articles = data.get("list")
        for item in articles:
            print('\nthe result is:\n',item)

with open(filepath,'r', encoding='utf-16') as fp:  
   line = fp.readline()
   while line:
       try:
           get_art(line.strip(),rex1)
       except:
           pass
       try:
           get_art(line.strip(),rex2)
       except:
           pass
       line = fp.readline()

我需要捕获（msgList = ....）或（general_msg_list＆＃34;：＆＃34; ...）中的数据。并将字符串转换为json。对于（msgList = ....）中的数据，我发现我需要使用＆＃34; data = html.unescape（data）＆＃34;，如果我使用＆＃34; data = html.unescape（数据）＆＃34;在（general_msg_list＆＃34;：＆＃34; ...）中，会出现错误。

目前，我使用

try:
    get_art(line.strip(),rex1)
except:
    pass
try:
    get_art(line.strip(),rex2)
except:
    pass

我认为应该有更好的方法来取代它。

也许更好的方法是我读取整个文件，而不是逐行读取。对我来说问题是我很难处理while文件数据，这就是为什么我一行一行地阅读它。

如何针对条件优化正则表达式

0 个答案: