Question

具有表格形式内容的文件可以以至少3种格式（UTF-8，UTF-16LE，Ascii）导出，具有以制表符分隔的列，包围分隔符或其他格式，并且具有引号/刺/等等围绕每个项目。以下函数读入一个UTF-8表，由pilcrows分隔，每个项目都被荆棘包围。

def read_app_dat(app_export):
    """ Reads and parses DAT exported from App

    Assumes that delimiters are concordance.

    Args:
        app_export: str, file path to DAT exported by App
    Returns:
        Dictionary where ID is mapped to a list where the first
        tuple is URI ID
    """
    app_dict = {}
    f = codecs.open(app_export, encoding='utf-8')
    for line in f:
        each_row = re.sub(r'\xfe', "", line).split("\x14")
        if "ID" in each_row[0] or "URI" in each_row[1]:
            pass
        else:
            app_dict[each_row[0]] = each_row[1]
    return app_dict

正如目前所写，我需要为每个场景定义不同的行。

each_row = re.sub(r'\xfe', "", line).split("\x14")

这不是一个非常pythonic的事情。我怎么能更好地处理分隔符，在这种情况下是Pilcrows和thorns，以便我可以将它们称为参数？到目前为止，codecs模块是最有用的。

感谢您的时间。

如何支持多种文件格式和字段分隔符？

0 个答案: