Question

我有要加载到数据框中的文本文件。这些值一旦加载，就会以以下格式存储在一列中：

0 Alabama[edit] 1 Auburn (something somethign) 2 Florence (something somethign) . . . 12 California[edit] 13 Angwin (something something) 14 Arcata (something something)

我必须将值拆分为2列：State和RegionName。

和State应该是索引

所有州名都有[edit]后缀，区域名称末尾有（....）。在我清理数据之前，我认为我可以使用[edit]和（..）作为掩码。

我试图将两个＆＃34;值＆＃34;

分开

df=pd.read_table("file.txt", names=["State","RegionName]) state=df[df["State"].str.contains(r"\[edit\]")] region=df[df["State"].str.contains(r"\s+\(.*\)")]

并尝试以某种方式合并这些，没有运气和如果我试图使用州和地区来制作新的df，我会得到一个索引错误

我尝试使用.str.extract

df.row.str.extract("(?P<State>\r\[\edit\]")

但是我得到一个错误，说df现在有.row（或.str）属性并且我确定模式也是错误的。

任何帮助将不胜感激。

谢谢和问候

Answer 1

这样的东西？

df['state'] = np.where(df.place.str.contains('edit'), df.place, np.nan)
df['region'] = np.where(df.place.str.contains('\('), df.place, np.nan)
df.drop('place', 1, inplace =True)
df['state'].ffill(inplace = True)
df.set_index('state', inplace = True)

                    region
state   
Alabama[edit]       NaN
Alabama[edit]       Auburn (something somethign)
Alabama[edit]       Florence (something somethign)
California[edit]    NaN
California[edit]    Angwin (something something)
California[edit]    Arcata (something something)

如何使用（最好）正则表达式模式将值从一列拆分为两列？

1 个答案: