我有一个带有文本列的excel文件。我需要做的就是从具有特定单词的每一行的文本列中提取句子。
我尝试过定义一个函数。
import pandas as pd
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
#################Reading in excel file#####################
str_df = pd.read_excel("C:\\Users\\HP\Desktop\\context.xlsx")
################# Defining a function #####################
def sentence_finder(text,word):
sentences=sent_tokenize(text)
return [sent for sent in sentences if word in word_tokenize(sent)]
################# Finding Context ##########################
str_df['context'] = str_df['text'].apply(sentence_finder,args=('snakes',))
################# Output file #################################
str_df.to_excel("C:\\Users\\HP\Desktop\\context_result.xlsx")
但如果我必须找到包含snakes
,venomous
,anaconda
等多个特定字词的句子,有人可以帮助我。这句话应至少有一个字。我无法使用多个单词处理nltk.tokenize
。
要搜索words = ['snakes','venomous','anaconda']
输入Excel文件:
text
1. Snakes are venomous. Anaconda is venomous.
2. Anaconda lives in Amazon.Amazon is a big forest. It is venomous.
3. Snakes,snakes,snakes everywhere! Mummyyyyyyy!!!The least I expect is an anaconda.Because it is venomous.
4. Python is dangerous too.
期望输出:
名为Context的列附加到上面的文本列。上下文列应该是:
1. [Snakes are venomous.] [Anaconda is venomous.]
2. [Anaconda lives in Amazon.] [It is venomous.]
3. [Snakes,snakes,snakes everywhere!] [The least I expect is an anaconda.Because it is venomous.]
4. NULL
提前致谢。
答案 0 :(得分:2)
以下是:
<?php
namespace app\controllers;
use app\models\Login;
class LoginController extends \yii\web\Controller
{
public function actionIndex()
{
// $this->layout = 'loginLayout';
// $this->render('index');
$details = new Login();
$model = $details->getUsers();
$this->render('index',array('model'=>$model));
}
}
你看到有几个问题,因为Private Sub btnShowUsers_Click()
'The User List Schema information requires this magic number. For anyone
'who may be interested, this number is called a GUID or Globally Unique
'Identifier - sorry for digressing
Const conUsers = "{947bb102-5d43-11d1-bdbf-00c04fb92675}"
Dim cnn As ADODB.Connection, fld As ADODB.Field, strUser As String
Dim rst As ADODB.Recordset, intUser As Integer, varValue As Variant
Set cnn = CurrentProject.Connection
Set rst = cnn.OpenSchema(Schema:=adSchemaProviderSpecific, SchemaID:=conUsers)
'Set List Box Heading
strUser = "Computer;UserName;Connected?;Suspect?"
Debug.Print rst.GetString
With rst 'fills Recordset (rst) with User List data
Do Until .EOF
intUser = intUser + 1
For Each fld In .Fields
varValue = fld.Value
'Some of the return values are Null-Terminated Strings, if
'so strip them off
If InStr(varValue, vbNullChar) > 0 Then
varValue = Left(varValue, InStr(varValue, vbNullChar) - 1)
End If
strUser = strUser & ";" & varValue
Next
.MoveNext
Loop
End With
Me!txtTotalNumOfUsers = intUser 'Total # of Users
'Set up List Box Parameters
Me!lstUsers.ColumnCount = 4
Me!lstUsers.RowSourceType = "Value List"
Me!lstUsers.ColumnHeads = False
lstUsers.RowSource = strUser 'populate the List Box
'Routine cleanup chores
Set fld = Nothing
Set rst = Nothing
Set cnn = Nothing
End Sub
由于标点符号而无法正常工作。
更新:处理复数。
这是更新的df:
In [1]: df['text'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if w.lower() in searched_words)])
0 [Snakes are venomous., Anaconda is venomous.]
1 [Anaconda lives in Amazon.Amazon is a big forest., It is venomous.]
2 [Snakes,snakes,snakes everywhere!, !The least I expect is an anaconda.Because it is venomous.]
3 []
Name: text, dtype: object
我们可以使用词干分析器(Wikipedia),例如PorterStemmer。
sent_tokenizer
首先,让我们对搜索到的词进行词干和小写:
text
Snakes are venomous. Anaconda is venomous.
Anaconda lives in Amazon. Amazon is a big forest. It is venomous.
Snakes,snakes,snakes everywhere! Mummyyyyyyy!!! The least I expect is an anaconda. Because it is venomous.
Python is dangerous too.
I have snakes
df = pd.read_clipboard(sep='0')
现在我们可以修改以上内容以包括词干:
from nltk.stem.porter import *
stemmer = nltk.PorterStemmer()
如果您只想要子串匹配,请确保searching_words是单数,而不是复数。
searched_words = ['snakes','Venomous','anacondas']
searched_words = [stemmer.stem(w.lower()) for w in searched_words]
searched_words
> ['snake', 'venom', 'anaconda']
顺便说一下,我可能会创建一个带有常规for循环的函数,这个带有列表推导的lambda失控了。