Question

考虑以下pandas.Series对象：

import pandas as pd

s = pd.Series(["hello there you would like to sort me", "sorted i would like to be", "the yankees played the red sox", "apple apple banana fruit orange cucumber"])

我想对每行里面的值进行排序，类似于以下方法：

for row in s.index: split_words = s.loc[row].split() split_words.sort() s.loc[row] = " ".join(split_words)

我有一个庞大的数据集，所以矢量化很重要，这里。我怎样才能使用pandas str属性来实现同样的目标，但要快得多？

Answer 1

我已经体验到Python列表在这些情况下表现更好。应用piRSquared的逻辑，列表理解将是：

[' '.join(sorted(sentence.split())) for sentence in s.tolist()]

对于时间安排，我使用过Peter Norvig's website的莎士比亚作品。

s = pd.read_table('shakespeare.txt', squeeze=True, header=None)
s = pd.Series(s.tolist()*10)
r1 = s.str.split().apply(sorted).str.join(' ')
r2 = pd.Series([' '.join(sorted(sentence.split())) for sentence in s.tolist()])

r1.equals(r2)
Out: True

%timeit s.str.split().apply(sorted).str.join(' ')
1 loop, best of 3: 2.71 s per loop

%timeit pd.Series([' '.join(sorted(sentence.split())) for sentence in s.tolist()])
1 loop, best of 3: 1.95 s per loop

Answer 2

使用字符串访问者str和split。然后应用sorted和join。

s.str.split().apply(sorted).str.join(' ')

0       hello like me sort there to would you
1                   be i like sorted to would
2              played red sox the the yankees
3    apple apple banana cucumber fruit orange
dtype: object

在pandas系列中对行中的值进行排序的方法？

2 个答案: