pandas自定义文件格式解析

时间:2017-10-15 23:07:20

标签: python pandas dataframe io

我的数据格式如下:

1_engineer_grade1 |Boolean IsMale IsNorthAmerican IsFromUSA |Name blah 2_lawyer_grade7 |Boolean IsFemale IsAlive |Children 2

我需要将其转换为包含以下列的数据框:

id job      grade Bool.IsMale Bool.IsFemale Bool.IsAlive Bool.IsNorthAmerican Bool.IsFromUSA Name Children
1  engineer 1     True        False         False        True                 True           blah NaN
2  lawyer   7     False       True          True         True                 False          NaN  2

我可以在python中预处理这些数据,然后就此调用pd.DataFrame,但我想知道是否有更好的方法吗?

更新:我最终执行了以下操作:如果有明显的优化,请告知我们

with open(vwfile, encoding='latin-1') as f:
    data = []
    for line in f:
        line = [x.strip() for x in line.strip().split('|')]
        # line == [
        #    "1_engineer_grade1",
        #    "|Boolean IsMale IsNorthAmerican IsFromUSA",
        #    "|Name blah"
        # ]
        ident, job, grade = line[0].split("_")
        features = line[1:]
        bools = {
            "IsMale": False,
            "IsFemale": False,
            "IsNorthAmerican": False,
            "IsFromUSA": False,
            "IsAlive": False,
        }
        others = {}
        for category in features:
            if category.startswith("Bools "):
                for feature in category.split(' ')[1:]:
                    bools[feature] = True
            else:
                feature = category.split(" ")
                # feature == ["Name", "blah"]
                others[feature[0]] = feature[1]
        featuredict = {
            'ident': ident,
            'job': job,
            'grade': grade,
        }            
        featuredict.update(bools)
        featuredict.update(others)
        data.append(featuredict)
df = pd.DataFrame(data)

UPDATE-2 一百万行文件需要大约55秒来处理。

0 个答案:

没有答案