从包含不需要的整数的不同长度的字符串值中提取某些整数。模式或位置

时间:2015-05-14 16:59:39

标签: python pandas dataframe

我有点像初学程序员,正在寻找帮助和解决问题。我希望将字符串中的ID号提取到新列中,然后填写缺少的数字。

我正在使用pandas数据帧,我有以下一组街道名称,其中一些有ID号码而其他人失踪:

*Start station*:
"19th & L St (31224)"
"14th & R St NW (31202)"
"Paul Rd & Pl NW (31602)"
"14th & R St NW"
"19th & L St"
"Paul Rd & Pl NW"

My desired outcome:
*Start station*         *StartStatNum*
"14th & R St NW"        31202
"19th & L St"           31224
"Paul Rd & Pl NW"       31602
"14th & R St NW"        31202
"19th & L St"           31224
"Paul Rd & Pl NW"       31602

在我分裂的第一步后,我遇到了困难。 我可以根据位置拆分以下内容:

def Stat_Num(Stat_Num):
    return Stat_Num.split('(')[-1].split(')')[0].strip()

db["StartStatNum"] = pd.DataFrame({'Num':db['Start station'].apply(Stat_Num)})

But this gives:
*Start station*         *StartStatNum*
"19th & L St (31224)"        31202
"14th & R St NW (31202)"     31224
"Paul Rd & Pl NW (31602)"    31602
"14th & R St NW"            "14th & R St NW"
"19th & L St"               "19th & L St"
"Paul Rd & Pl NW"           "Paul Rd & Pl NW"

当我想使用我没有的电台ID编号查找/填写StartStatNum时会出现问题。

我一直试图了解str.extract, str.contains, re.findall 并尝试以下作为可能的垫脚石:

db['Start_S2']  = db['Start_Stat_Num'].str.extract(" ((\d+))")
db['Start_S2']  = db['Start station'].str.contains(" ((\d+))")
db['Start_S2']  = db['Start station'].re.findall(" ((\d+))")

我还从here

尝试了以下内容
def parseIntegers(mixedList):
return [x for x in db['Start station'] if (isinstance(x, int) or isinstance(x, long)) and not isinstance(x, bool)]

然而,当我传入值时,我得到一个带有1个值的列表'x'。 作为一个菜鸟,我不认为走模式路线是最好的,因为它也会采取不需要的整数(虽然我可能会转向Nan的,因为它们将小于30000(ID号的最低值) 我也有一个想法,它可能是一个简单的我忽略,但经过大约20个小时和大量的搜索,我有点不知所措。

任何帮助都会非常有帮助。

2 个答案:

答案 0 :(得分:1)

解决方案可能是使用转换

创建数据框
station -> id 

喜欢

l = ["19th & L St (31224)",
    "14th & R St NW (31202)",
    "Paul Rd & Pl NW (31602)",
    "14th & R St NW",
    "19th & L St",
    "Paul Rd & Pl NW",]

df = pd.DataFrame( {"station":l})
df_dict = df['station'].str.extract("(?P<station_name>.*)\((?P<id>\d+)\)").dropna()
print df_dict

 # result:
       station_name     id
 0      19th & L St   31224
 1   14th & R St NW   31202
 2  Paul Rd & Pl NW   31602
 [3 rows x 2 columns]

从那里开始,你可以使用一些列表理解:

l2 = [ [row["station_name"], row["id"]]
       for line in l
       for k,row in df_dict.iterrows()
       if row["station_name"].strip() in line]

得到:

 [['19th & L St ', '31224'], 
  ['14th & R St NW ', '31202'], 
  ['Paul Rd & Pl NW ', '31602'], 
  ['14th & R St NW ', '31202'], 
  ['19th & L St ', '31224'], 
  ['Paul Rd & Pl NW ', '31602']]

我允许你在数据帧中转换后者......

最后一部分可能有更好的解决方案......

答案 1 :(得分:1)

这是一种对我有用的方法,首先提取大括号中的数字:

In [71]:

df['start stat num'] = df['Start station'].str.findall(r'\((\d+)\)').str[0]
df
Out[71]:
             Start station start stat num
0      19th & L St (31224)          31224
1   14th & R St NW (31202)          31202
2  Paul Rd & Pl NW (31602)          31602
3           14th & R St NW            NaN
4              19th & L St            NaN
5          Paul Rd & Pl NW            NaN

现在删除号码,因为我们不再需要它了:

In [72]:

df['Start station'] = df['Start station'].str.split(' \(').str[0]
df
Out[72]:
     Start station start stat num
0      19th & L St          31224
1   14th & R St NW          31202
2  Paul Rd & Pl NW          31602
3   14th & R St NW            NaN
4      19th & L St            NaN
5  Paul Rd & Pl NW            NaN

现在我们可以通过调用df上的map并删除NaN行来填写缺少的站号,并将站名设置为索引,这将查找站名并返回站号:< / p>

In [73]:

df['start stat num'] = df['Start station'].map(df.dropna().set_index('Start station')['start stat num'])
df
Out[73]:
     Start station start stat num
0      19th & L St          31224
1   14th & R St NW          31202
2  Paul Rd & Pl NW          31602
3   14th & R St NW          31202
4      19th & L St          31224
5  Paul Rd & Pl NW          31602