Question

我想将pandas数据框中的一列（实际上相当大，大约150万行文本数据）与一个字符串进行比较。为了进行简单的健全性检查/测试，我只想在前100行进行尝试，以使执行起来不会花费太长时间。因此，数据框的最小样本如下所示：

Text
Hello, this is Peter, what would you need me to help you with today? I need you
Good Morning, John here, are you calling regarding your cell phone bill? I am not
......

我有一个固定的字符串

"Can I help you today?"

我的目标是获得一个相似度分数（我仍在确定我使用的是Levenstein还是Jaccard或Cosine度量），但这不是我的主要问题，要获得每个熊猫数据框值和固定字符串之间的相似度分数值，然后可能只是按顺序对其进行排序。

这是我编写的代码：

import nltk
nltk.download()
nltk.download('stopwords')
nltk.download('wordnet')

Levenstein = []
Counter = 0

for x in All_sentences.rows:
    while Counter < 100:
        distance = nltk.edit_distance(All_sentences['Text'], "what I wanted 
        to calling because I lost my  ATM card debit card")
        Levenstein.append(distance)
        Counter +=1

当我运行此代码时，首先，它会弹出一个带有NLTK下载器的对话框

[WinError 10060] A connection attempt failed because the connected party did 
not properly respond after a period of time, or established connection 
failed because connected host has failed to respond.

其次，我看到一条消息，提示（在正在运行但未完成执行的代码下面）：

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

我等待了很长时间，输出中没有任何显示（它仍在运行，我只看到*仍在处理中）。

这些消息是什么，最重要的是，如果我仅对100个值而不是整个数据集进行样本比较，为什么要花这么长时间处理？

Answer 1

看，我认为问题出在下载NLTK软件包之内。首先，请确保您的互联网连接正常且稳定。然后，打开终端并输入以下命令：

$ python
>>> import nltk
>>> nltk.download('popular')

这将打开python shell并下载NTLK中流行的软件包。看起来像这样：现在，删除后运行代码：

nltk.download()
nltk.download('stopwords')
nltk.download('wordnet')

计算熊猫数据框列值和给定字符串之间的编辑距离

1 个答案: