如何使用熊猫来截断句子的左右部分

时间:2018-04-23 14:39:24

标签: python pandas

将句子转换为单词列表,然后找到根字符串的索引应该做的事情:

sentence = "lack of association between the promoter polymorphism of the mtnr1a gene and adolescent idiopathic scoliosis"
root = "mtnr1a"

try:
    words = sentence.split()
    n = words.index(root)
    cutoff = ' '.join(words[n-4:n+5])
except ValueError:
    cutoff = None

print(cutoff)

结果:

promoter polymorphism of the mtnr1a gene and adolescent idiopathic

如何在pandas数据框中使用它?

我试试:

sentence = data['sentence'] 
root = data['rootword'] 
def cutOff(sentence,root): 
   try: 
      words = sentence.str.split() 
      n = words.index(root) 
      cutoff = ' '.join(words[n-4:n+5]) 
except ValueError: 
      cutoff = None 
      return cutoff 
data.apply(cutOff(sentence,root),axis=1)

但它不起作用......

编辑:

如果在根词之后的4个字符串后,当根词在句子中的第一个位置时,以及当根词在句子中的最后位置时,如何剪切句子? 例如:

sentence = "mtnr1a lack of association between the promoter polymorphism of the gene and adolescent idiopathic scoliosis"
out if root in first position:
"mtnr1a lack of association between"
out if root in last position:
"lack of association between the promoter polymorphism of the gene and adolescent idiopathic scoliosis"
"adolescent idiopathic scoliosis mtnr1a"

1 个答案:

答案 0 :(得分:0)

代码中的两个小调整可以解决您的问题:

首先,在数据框上调用apply()会将函数应用于调用它的DataFrame的每一行中的值。

您不必将列作为函数的输入传入,并且调用sentence.str.split()没有意义。 cutOff()函数sentence内部只是一个常规字符串(不是列)。

将您的功能更改为:

def cutOff(sentence,root): 
    try: 
        words = sentence.split()  # this is the line that was changed
        n = words.index(root) 
        cutoff = ' '.join(words[n-4:n+5]) 
    except ValueError: 
        cutoff = None 
    return cutoff

接下来,您只需指定将作为功能输入的列 - 您可以使用lambda执行此操作:

df.apply(lambda x: cutOff(x["sentence"], x["rootword"]), axis=1)
#0    promoter polymorphism of the mtnr1a gene and a...
#dtype: object