如何根据列中的值有效地解析Pandas Dataframe中的子字符串?

时间:2017-05-04 21:51:58

标签: python pandas dataframe

假设我有下表:

@JmsListener(
        destination = "${default-queue-name-to-listen}",
        subscription = "${default-queue-name-to-listen}"
    )
    public void receiveMessage(Message<T> message) throws JMSException {}

此表表示所需的输出。初始输入只是前两列。问题是:我如何优雅地到达那里?不应处理没有内容的行。

我尝试了以下内容:

random_string|end_location|substring
-------------|------------|---------
HappyBirthday|     4      |Happ
GoodBye      |     5      |GoodB
NaN          |    NaN     |NaN
Haensel      |     2      |Ha
...          |     ...    |...

这种方法的问题是所有字符串都会被切割成相同的长度而不是预期的结果。

我试过了:

df['random_string'].str[0:4] or [0:5]

这有效,但我感觉相当不优雅和低效。 (如何)(C /)可以以更优雅的方式执行 - 也许是矢量化的方式。也许适用的东西可以起作用吗?

2 个答案:

答案 0 :(得分:2)

试试这个:

In [24]: df
Out[24]:
   random_string  end_location
0  HappyBirthday           4.0
1        GoodBye           5.0
2            NaN           NaN
3        Haensel           2.0

In [25]: mask = (df['random_string'].str.len() >= 0) & (df['end_location'] >= 0)

In [26]: df[mask]
Out[26]:
   random_string  end_location
0  HappyBirthday           4.0
1        GoodBye           5.0
3        Haensel           2.0

In [27]: df.loc[mask, 'substring'] = [t[0][:int(t[1])] for t in df[mask].values.tolist()]

In [28]: df
Out[28]:
   random_string  end_location substring
0  HappyBirthday           4.0      Happ
1        GoodBye           5.0     GoodB
2            NaN           NaN       NaN
3        Haensel           2.0        Ha

计时用于更大的(40K行)DF

In [179]: df = pd.concat([df] * 10**4, ignore_index=True)

In [40]: %%timeit
    ...: mask = (df['random_string'].str.len() >= 0) & (df['end_location'] >= 0)
    ...: [t[0][:int(t[1])] for t in df[mask].values.tolist()]
    ...:
10 loops, best of 3: 77.3 ms per loop

In [41]: df.shape
Out[41]: (40000, 2)

答案 1 :(得分:2)

.my-element {
  color: rgba(0, 170, 255, 0.5);
}

使用子集

df

   random_string  end_location
0  HappyBirthday           4.0
1        GoodBye           5.0
2            NaN           NaN
3        Haensel           2.0

时间

d1 = df.dropna()
rs = d1.random_string.values.tolist()
el = d1.end_location.values.astype(int).tolist()  # Thx @MaxU for `astype(int)`
df.loc[d1.index, 'substring'] = [s[:n] for s, n in zip(rs, el)]

   random_string  end_location substring
0  HappyBirthday           4.0      Happ
1        GoodBye           5.0     GoodB
2            NaN           NaN       NaN
3        Haensel           2.0        Ha