我的数据格式如下:
1_engineer_grade1 |Boolean IsMale IsNorthAmerican IsFromUSA |Name blah
2_lawyer_grade7 |Boolean IsFemale IsAlive |Children 2
我需要将其转换为包含以下列的数据框:
id job grade Bool.IsMale Bool.IsFemale Bool.IsAlive Bool.IsNorthAmerican Bool.IsFromUSA Name Children
1 engineer 1 True False False True True blah NaN
2 lawyer 7 False True True True False NaN 2
我可以在python中预处理这些数据,然后就此调用pd.DataFrame
,但我想知道是否有更好的方法吗?
更新:我最终执行了以下操作:如果有明显的优化,请告知我们
with open(vwfile, encoding='latin-1') as f:
data = []
for line in f:
line = [x.strip() for x in line.strip().split('|')]
# line == [
# "1_engineer_grade1",
# "|Boolean IsMale IsNorthAmerican IsFromUSA",
# "|Name blah"
# ]
ident, job, grade = line[0].split("_")
features = line[1:]
bools = {
"IsMale": False,
"IsFemale": False,
"IsNorthAmerican": False,
"IsFromUSA": False,
"IsAlive": False,
}
others = {}
for category in features:
if category.startswith("Bools "):
for feature in category.split(' ')[1:]:
bools[feature] = True
else:
feature = category.split(" ")
# feature == ["Name", "blah"]
others[feature[0]] = feature[1]
featuredict = {
'ident': ident,
'job': job,
'grade': grade,
}
featuredict.update(bools)
featuredict.update(others)
data.append(featuredict)
df = pd.DataFrame(data)
UPDATE-2 一百万行文件需要大约55秒来处理。