如何在整个数据框中用各种长字符串替换较短的字符串?

时间:2016-01-14 16:29:10

标签: python pandas

我想用更短的字符串替换数据框中的长字符串。我有一个我要做的替换短词典。

import pandas as pd
from StringIO import StringIO

replacement_dict = {
    "substring1": "substring1",
    "substring2": "substring2",
    "a short substring": "substring3",
}

exampledata = StringIO("""id;Long String
1;This is a long substring1 of text that has lots of words
2;This is substring2 and also contains more text than needed
3;This is a long substring1 of text that has lots of words
4;This is substring2 and also contains more text than needed
5;This is substring2 and also contains more text than needed
6;This is substring2 and also contains more text than needed
7;Within this string is a short substring that is unique
8;This is a long substring1 of text that has lots of words
9;Within this string is a short substring that is unique
10;Within this string is a short substring that is unique
""")

df = pd.read_csv(exampledata, sep=";")
print df

for s in replacement_dict.keys():
    if df['Long String'].str.contains(s):
        df['Long String'] = replacement_dict[df['Long String'].str.contains(s)]

预期的数据框如下所示:

   id  Long String
0   1  substring1
1   2  substring2
2   3  substring1
3   4  substring2
4   5  substring2
5   6  substring2
6   7  substring3
7   8  substring1
8   9  substring3
9  10  substring3

当我运行上面的代码时,我收到此错误:

Traceback (most recent call last):
  File "test.py", line 27, in <module>
    if df['Long String'].str.contains(s):
  File "h:\Anaconda\lib\site-packages\pandas\core\generic.py", line 731, in __nonzero__.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

如何在整个数据帧中用各种长字符串替换较短的字符串?

1 个答案:

答案 0 :(得分:1)

您可以使用.replace()执行此类操作。但是,您必须稍微修改字典以获得预期的结果。

replacement_dict = {
    ".*substring1.*": "substring1",
    ".*substring2.*": "substring2",
    ".*a short substring.*": "substring3",
}

我做了什么使得键成为正则表达式字符串。它将匹配之前的所有内容以及要匹配的子字符串之后的所有内容。这在一分钟内很重要。

接下来,使用以下内容替换整个for循环:

df['Long String'] = df['Long String'].replace(replacement_dict, regex=True)

.replace()可以使用字典,其中键是您匹配的字符串,值是替换文本。更改键以捕获子字符串之前和之后的所有内容的原因是因为我们现在可以替换整个值,而不只是替换一个小的匹配字符串。

例如,没有.*部分的字典会转换为如下数据框:

   id                                        Long String
0   1  This is a long substring1 of text that has lot...
1   2  This is substring2 and also contains more text...
2   3  This is a long substring1 of text that has lot...
3   4  This is substring2 and also contains more text...
4   5  This is substring2 and also contains more text...
5   6  This is substring2 and also contains more text...
6   7    Within this string is substring3 that is unique
7   8  This is a long substring1 of text that has lot...
8   9    Within this string is substring3 that is unique
9  10    Within this string is substring3 that is unique

请注意,您真正看到的唯一变化是使用&#34;短子串&#34;值,因为你真的只是替换了#34; substring1&#34;和&#34; substring2&#34;与自己。

现在,如果我们再添加正则表达式通配符,我们就会得到:

   id Long String
0   1  substring1
1   2  substring2
2   3  substring1
3   4  substring2
4   5  substring2
5   6  substring2
6   7  substring3
7   8  substring1
8   9  substring3
9  10  substring3