熊猫:从熊猫系列中的未知数据结构中提取值

时间:2018-02-04 10:36:43

标签: python pandas parsing

我有一个pandas系列,这样系列的每一行都包含一个字符串,其格式如下(键 - 值结构):

  

“客户名称 - Eric \ nFamily名称 - Lammela \ n 衬衫颜色 - 白色\ n \ n”   字符串中的字段可能会更改:   “客户名称 - Leo \ nFamily名称 - Messi \ n 裤子颜色 - 黑色\ n”

我想将整个系列转换为DataFrame。 什么是最有效的方式?

1 个答案:

答案 0 :(得分:0)

你可以尝试这样的事情。我用你提供的例子来试试。

import re
import pandas as pd

# Stored your example in the string
s = pd.Series(["Customer Name - Eric\nFamily Name - Lammela\nShirt color - white\n\n","Customer Name - Leo\nFamily Name - Messi\nPants color - black\n"])

# Define a function to convert each string in the Series to a json format
def str_to_dict(txt):
    txt = txt.rstrip('\n')
    txt = re.sub('^', '{"', txt)
    txt = re.sub(' - ', '": "', txt)
    txt = re.sub('\n', '", "', txt)
    txt = re.sub('$', '"}', txt)
    return(txt)

# Apply the function to the Series and store the results in a new Series
s1 = s.apply(str_to_dict)

# Create an empty DataFrame
df = pd.DataFrame()

# Loop through the converted Series and append the items to the DataFrame
# after using json to convert them to a dictionary
for c in s1:
    df = df.append(json.loads(c), ignore_index=True)

# Printed the df to check the results.
print(df)

  Customer Name Family Name Shirt color Pants color
0          Eric     Lammela       white         NaN
1           Leo       Messi         NaN       black

希望这有帮助。