Question

我有一个大的excel文件，如下所示：

Timestamp       Text                                Work        Id
5/4/16 17:52    rain a lot the packs maybe damage.  Delivery    XYZ
5/4/16 18:29    wh. screen                          Other       ABC
5/4/16 14:54    15107 Lane Pflugerville, 
                TX customer called me and his phone 
                number and my phone numbers were not 
                masked. thank you customer has had a 
                stroke and items were missing from his 
                delivery the cleaning supplies for his 
                wet vacuum steam cleaner.  he needs a 
                call back from customer support     Delivery    YYY
5/6/16 13:05    How will I know if I                Signing up  ASX
5/4/16 23:07    an quality                          Delivery    DFC

我只想在“文本”列上工作，然后在“文本”列中删除那些基本上只有乱码的行（上例中的行2,4,5）。

我只读第二栏如下：

import xlrd
book = xlrd.open_workbook("excel.xlsx")
sheet = book.sheet_by_index(0)
for row_index in xrange(1, sheet.nrows): # skip heading row
    timestamp, text = sheet.row_values(row_index, end_colx=2)
    text)
    print (text)

如何删除乱码行？我有一个想法，我需要使用nltk并有一个积极的语料库（一个没有任何胡言乱语），一个负面语料库（只有乱码文本），并用它训练我的模型。但是我该如何实施呢？请帮忙!!

Answer 1

您可以使用nltk执行以下操作。

import nltk
english_words = set(w.lower() for w in nltk.corpus.words.words())

'a' in english_words
True

'dog' in english_words
True

'asdasdase' in english_words
False

如何从字符串中获取nltk中的单个单词：

individual_words_front_string = nltk.word_tokenize('This is my text from text column')

individual_words_front_string
['This', 'is,' 'my', 'text', 'from', 'text', 'column']

对于每行文本列，测试单个单词以查看它们是否在英语词典中。如果它们都是，那么你知道行文本列我们不是乱码。

如果您对乱码与非乱码的定义与nltk中的英语单词不同，您可以使用上面相同的过程，只是使用不同的可接受单词列表。

如何接受号码和街道地址？

确定某些内容是否为数字的简单方法。

word = '32423432' 
word.isdigit()
True

word = '32423432ds' 
word.isdigit()
False

地址更难。你可以在这里找到相关信息：Parsing Addresses，可能还有很多其他地方。如果您可以访问城市，州，道路等列表，您当然可以使用上述逻辑。

如果任何一个单词为假，它会失败吗？

这是你决定的代码。如果文本中x％的单词是假的，也许你可以将某些东西标记为乱码？

如何判断语法是否正确？

这是一个更大的主题，可以在以下链接中找到更深入的解释： Checking Grammar。但是上面的答案只会检查nltk语料库中是否有单词，而不是句子是否在语法上正确。

Answer 2

将好文与'gibber＆＃39;分开。这不是一项微不足道的任务，特别是如果你正在处理短信/聊天（这对我来说是什么样的）。

拼写错误的单词不会使样本无法使用，甚至句法错误的句子也不应取消整个文本的资格。这是您可以用于报纸文本的标准，但不适用于原始的用户生成内容。

我会注释一个语料库，在这个语料库中你将好的样本与坏的样本分开并训练一个简单的分类器。注释不一定是一个很大的努力，因为这些乱码文本比好的文本短，应该是容易识别（至少有一些）。此外，您可以尝试以大约100个数据点（50个好/ 50个坏）的语料库大小开始，并在第一个模型或多或少工作时展开它。

这是我一直用于文本分类的示例代码。你需要安装scikit-learn和numpy：

import re
import random
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Prepare data

def prepare_data(data):
    """
    data is expected to be a list of tuples of category and texts.
    Returns a tuple of a list of lables and a list of texts
    """
    random.shuffle(data)
    return zip(*data)

# Format training data

training_data = [
    ("good", "rain a lot the packs maybe damage."),
    ("good", "15107 Lane Pflugerville, TX customer called me and his phone number and my phone numbers were not masked. thank you customer has had a stroke and items were missing from his delivery the cleaning supplies for his wet vacuum steam cleaner.  he needs a call back from customer support "),
    ("gibber", "wh. screen"),
    ("gibber", "How will I know if I")
]
training_labels, training_texts = prepare_data(training_data)


# Format test set
test_data = [
    ("gibber", "an quality"),
    ("good", "<datapoint with valid text>",
    # ...
]
test_labels, test_texts = prepare_data(test_data)


# Create feature vectors

"""
Convert a collection of text documents to a matrix of token counts.
See: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
"""
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(training_texts)
y = training_labels


# Train the classifier

clf = LogisticRegression()
clf.fit(X, y)


# Test performance

X_test = vectorizer.transform(test_texts)
y_test = test_labels

# Generates a list of labels corresponding to the samples
test_predictions = clf.predict(X_test)

# Convert back to the usual format
annotated_test_data = list(zip(test_predictions, test_texts))

# evaluate predictions
y_test = np.array(test_labels)
print(metrics.classification_report(y_test, test_predictions))
print("Accuracy: %0.4f" % metrics.accuracy_score(y_test, test_predictions))

# predict labels for unknown texts
data = ["text1", "text2",]
# Important: use the same vectorizer you used for the training.
# When saving the model (e.g. via pickle) always serialize
# classifier & vectorizer
X = vectorizer.transform(data)
# Now predict the labels for the texts in 'data'
labels = clf.predict(X)
# And put them back together 
result = list(zip(labels, data))
# result = [("good", "text1"), ("gibber", "text2")]

关于它是如何工作的几句话：计数矢量化器标记文本并创建包含语料库中所有单词的计数的向量。基于这些向量，分类器尝试识别模式以区分两个类别。只有少数几个不常见的（b / c拼写错误的）单词的文字宁愿出现在＆＃39; gibber＆＃39;类别，而一个文字有很多常用句子的典型单词（想想这里的所有停用词：＆＃39;我＆＃39;，＆＃39;你＆＃39;，＆＃39;是＆＃39; ; ...）更容易成为一个好文本。

如果此方法适合您，您还应该尝试其他分类器并使用第一个模型半自动注释更大的训练语料库。

如何提取列中只有有意义文本的行

2 个答案: