使用特定单词的pandas提取句子

时间:2016-11-29 08:47:51

标签: python pandas nltk

我有一个带有文本列的excel文件。我需要做的就是从具有特定单词的每一行的文本列中提取句子。

我尝试过定义一个函数。

import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

#################Reading in excel file#####################

str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx")

################# Defining a function #####################

def sentence_finder(text,word):
    sentences=sent_tokenize(text)
    return [sent for sent in sentences if word in word_tokenize(sent)]
################# Finding Context ##########################
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',))

################# Output file #################################
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")

但如果我必须找到包含snakesvenomousanaconda等多个特定字词的句子,有人可以帮助我。这句话应至少有一个字。我无法使用多个单词处理nltk.tokenize

要搜索words = ['snakes','venomous','anaconda']

输入Excel文件:

                    text
     1.  Snakes are venomous. Anaconda is venomous.
     2.  Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
     3.  Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an    anaconda.Because it is venomous.
     4.  Python is dangerous too.

期望输出:

名为Context的列附加到上面的文本列。上下文列应该是:

 1.  [Snakes are venomous.] [Anaconda is venomous.]
 2.  [Anaconda lives in Amazon.] [It is venomous.]
 3.  [Snakes,snakes,snakes everywhere!] [The least I expect is an    anaconda.Because it is venomous.]
 4.  NULL

提前致谢。

1 个答案:

答案 0 :(得分:2)

以下是:

<?php
    namespace app\controllers;
    use app\models\Login;
    class LoginController extends \yii\web\Controller
    {
        public function actionIndex()
        {
            // $this->layout = 'loginLayout';
            // $this->render('index');
            $details = new Login();
            $model = $details->getUsers();
            $this->render('index',array('model'=>$model)); 
        }
    }

你看到有几个问题,因为Private Sub btnShowUsers_Click() 'The User List Schema information requires this magic number. For anyone 'who may be interested, this number is called a GUID or Globally Unique 'Identifier - sorry for digressing Const conUsers = "{947bb102-5d43-11d1-bdbf-00c04fb92675}" Dim cnn As ADODB.Connection, fld As ADODB.Field, strUser As String Dim rst As ADODB.Recordset, intUser As Integer, varValue As Variant Set cnn = CurrentProject.Connection Set rst = cnn.OpenSchema(Schema:=adSchemaProviderSpecific, SchemaID:=conUsers) 'Set List Box Heading strUser = "Computer;UserName;Connected?;Suspect?" Debug.Print rst.GetString With rst 'fills Recordset (rst) with User List data Do Until .EOF intUser = intUser + 1 For Each fld In .Fields varValue = fld.Value 'Some of the return values are Null-Terminated Strings, if 'so strip them off If InStr(varValue, vbNullChar) > 0 Then varValue = Left(varValue, InStr(varValue, vbNullChar) - 1) End If strUser = strUser & ";" & varValue Next .MoveNext Loop End With Me!txtTotalNumOfUsers = intUser 'Total # of Users 'Set up List Box Parameters Me!lstUsers.ColumnCount = 4 Me!lstUsers.RowSourceType = "Value List" Me!lstUsers.ColumnHeads = False lstUsers.RowSource = strUser 'populate the List Box 'Routine cleanup chores Set fld = Nothing Set rst = Nothing Set cnn = Nothing End Sub 由于标点符号而无法正常工作。

更新:处理复数。

这是更新的df:

In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
                                       if any(True for w in word_tokenize(sent) 
                                               if w.lower() in searched_words)])

0    [Snakes are venomous., Anaconda is venomous.]
1    [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2    [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3    []
Name: text, dtype: object

我们可以使用词干分析器(Wikipedia),例如PorterStemmer

sent_tokenizer

首先,让我们对搜索到的词进行词干和小写:

text
Snakes are venomous. Anaconda is venomous.
Anaconda lives in Amazon. Amazon is a big forest. It is venomous.
Snakes,snakes,snakes everywhere! Mummyyyyyyy!!! The least I expect is an anaconda. Because it is venomous.
Python is dangerous too.
I have snakes


df = pd.read_clipboard(sep='0')

现在我们可以修改以上内容以包括词干:

from nltk.stem.porter import *
stemmer = nltk.PorterStemmer()

如果您只想要子串匹配,请确保searching_words是单数,而不是复数。

searched_words = ['snakes','Venomous','anacondas']
searched_words = [stemmer.stem(w.lower()) for w in searched_words]
searched_words

> ['snake', 'venom', 'anaconda']

顺便说一下,我可能会创建一个带有常规for循环的函数,这个带有列表推导的lambda失控了。