如何根据关键字将字符串拆分为新的数据框行

时间:2019-07-29 23:55:29

标签: python pandas

我想在出现副词时将行拆分为新行。但是,如果连续出现多个副词,那么我只想在最后一个副词之后分成新的一行。

我的数据框示例如下:

                   
0         but well that's alright 
1 otherwise however we'll have to  
2                       okay sure 
3                           what? 

使用副词= ['but','well','otherwise','however'],我希望生成的df看起来像这样:

    0             but well
    1         that's alright 
    2         otherwise however  
    3         we'll have to  
    2         okay sure 
    3         what? 

2 个答案:

答案 0 :(得分:0)

我有一个局部解决方案,也许可以帮上忙。 您可以使用TextBlob软件包。

使用此API,您可以为每个单词分配一个令牌。 here中提供了可能的令牌列表。

问题在于,标记单词并不完美,并且您对副词的定义可能与它们的定义不匹配(例如,but是API上的coordinating conjunction,而well标记出于某种原因是一个动词,但在大多数情况下仍然有效:

可以通过这种方式进行拆分

from textblob import TextBlob

def adv_split(s):
    annotations = TextBlob(s).tags
    # Extract adverbs (CC for coordinating conjunction or RB for adverbs)
    adv_words = [ word for word,tag in annotations 
                  if tag.startswith('CC') or tag.startswith('RB') ]
    # We have at least one adverb
    if len(adv_words) >0:
        # Get the last one
        adv_pos = s.index(adv_words[-1]) + len(adv_words[-1])
        return [s[:adv_pos], s[adv_pos:]]
    else:
        return s

然后,您可以使用pandas apply()和新的explode()方法( pandas> 0.25 )来拆分数据框:

import pandas as pd

data = pd.Series(["but well that's alright",
                  "otherwise however we'll have to",
                  "okay sure",
                  "what?"])
data.apply(adv_split).explode()

您得到:

0                     but
0     well that's alright
1       otherwise however
1           we'll have to
2               okay sure
3                   what?

这不是完全正确,因为well的标签是错误的,但是您有主意。

答案 1 :(得分:0)

var iconFont = Typeface.CreateFromAsset(Context.Assets, "xxx.ttf");
Control.Typeface = iconFont;