pandas:在连字符之前或之后提取特定文本,以给定的子字符串结尾

时间:2018-04-02 22:05:25

标签: python string pandas substring text-processing

我是Marathon.raceValues = new int[aValue][aValue];的新手,并且pandas与以下内容类似

data frame

我想从上面的import pandas as pd df = pd.DataFrame({'id': ["1", "2", "3","4","5"], 'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd", "Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd", "Company Not Special – R Mill","Greatest Company – Great World POM"]}) id mill 0 1 Company A Palm Oil Mill – Special Company A of... 1 2 Company X POM – Company X Ltd 2 3 DDDD Mill – Company New and Old Ltd 3 4 Company Not Special – R Mill 4 5 Greatest Company – Great World POM 获得的内容如下所示:

enter image description here

是否有一种简单的方法可以将这些子字符串提取到同一列中。磨机名称有时可以在' - '之前和之后,但几乎总是以棕榈油厂,POM或磨机结束。

3 个答案:

答案 0 :(得分:1)

以前的解决方案:您可以使用.str.split()并执行此操作: df.mill = df.mill.str.split(' –').str[0]

更新:看到你有一些限制,你可以建立自己的返回函数(下面称为func)并将你想要的任何逻辑放在那里。这将循环遍历由-分割的所有字符串,如果Mill在您返回的第一个单词中。

在其他情况下,我推荐温的解决方案。

import pandas as pd 

df = pd.DataFrame({'id': ["1", "2", "3","4","5"],
                   'mill': ["Company A Palm Oil Mill – Special Company A of CC Ltd",
                            "Company X POM – Company X Ltd","DDDD Mill – Company New and Old Ltd",
                            "Company Not Special – R Mill","Greatest Company – Great World POM"]})

def func(x):
    #Split array
    ar = x.split(' – ')

    # If length is smaller than 2 return value
    if len(ar) < 2:
        return x

    # Else loop through and apply logic here
    for ind, x in enumerate(ar):
        if x.lower().endswith(('mill', 'pom')):
            return x

    # Nothing found, return x
    return x

df.mill = df.mill.apply(func)

print(df)

返回:

  id                     mill
0  1  Company A Palm Oil Mill
1  2            Company X POM
2  3                DDDD Mill
3  4                   R Mill
4  5          Great World POM

答案 1 :(得分:1)

IIUC,您可以将str.contains与关键词 Palm Oil Mill,POM,Mill

一起使用
s = df.mill.str.split(' – ', expand=True)

df['Name']=s[s.apply(lambda x : x.str.contains('Palm Oil Mill|POM|Mill'))].fillna('').sum(1)
df
Out[230]: 
  id                                               mill  \
0  1  Company A Palm Oil Mill – Special Company A of...   
1  2                      Company X POM – Company X Ltd   
2  3                DDDD Mill – Company New and Old Ltd   
3  4                       Company Not Special – R Mill   
4  5                 Greatest Company – Great World POM   
                      Name  
0  Company A Palm Oil Mill  
1            Company X POM  
2                DDDD Mill  
3                   R Mill  
4          Great World POM  

答案 2 :(得分:1)

你想拆分连字符(如果有的话),并返回以'Mill'或'POM'结尾的子串:

weak