Question

我有一段时间一直在尝试解决的问题。我必须使用类似于CSV的数据集，并且有一列包含方程式形式的数据。这是此列内容的示例：

validate employee="Claire" car="V_13" start="B02" stop="B13" start_date="21072018_095000" stop_date="21072018_103000"

因此，我想将此列分为6列：用熊猫验证雇员，汽车，起步，停车，start_date，stop_date及其引号之间的相应数据。

数据集已经在数据框中。

提前谢谢

Answer 1

您可以将Series.str.extractall与随后的索引和unstacking一起使用：

# Assuming DataFrame is in the form
df = pd.DataFrame(['''validate employee="Claire" car="V_13" start="B02" stop="B13" start_date="21072018_095000" stop_date="21072018_103000"''','''validate employee="Claire" car="V_13" start="B02" stop="B13" start_date="21072018_095000" stop_date="21072018_103000"'''])

df[0].str.extractall(r'(\S+)="(.*?)"').set_index(0, append=True).droplevel(1).unstack(1)

[出]

      1                                                      
0   car employee start       start_date stop        stop_date
0  V_13   Claire   B02  21072018_095000  B13  21072018_103000
1  V_13   Claire   B02  21072018_095000  B13  21072018_103000

Answer 2

以下是克里斯（Chris A）的答案：

# Assuming DataFrame is in the form
df = pd.DataFrame(['''validate employee="Claire" car="V_13" start="B02" stop="B13" start_date="21072018_095000" stop_date="21072018_103000"''','''validate employee="Claire" car="V_13" start="B02" stop="B13" start_date="21072018_095000" stop_date="21072018_103000"'''])

# Get the column names and column values
c_names= df[0].str.findall(r'(\S+)=')
c_values= df[0].str.findall(r'"(.*?)"')

pd.DataFrame(list(c_values),columns=c_names[0])

Answer 3

假设df['COL']拥有那些有问题的值，并且假设它们始终以"validate "开头。

我们可以简单地分割字符串的其余部分，例如employee="Claire" car="V_13"到dict之类的{'employee':'V_13', 'car':'V_13'}，然后将其提供给pd.Series()，它将根据您的需要进行整洁地处理。总而言之，这里有一个解决方案：

df['COL'].apply(lambda x: pd.Series({t.split('=')[0]:t.split('=')[1].strip('""') for t in x[len('validate '):].split(' ')}))

当然，这是假设字符串遵循非常严格的格式进行有意义的简单解析（例如.split(' ')）。可以根据您的特定需求/鲁棒性进行调整，但要旨是：将pd.Series()与dict一起使用，该=DATEDIF(Date1,Date2,"M")是从格式化字符串中解析出来的。

在公式中具有表达式的拆分列

3 个答案: