Question

这是我想要做的。我有一个csv。第1栏有人名（即：“迈克尔乔丹”，“安德森席尔瓦”，“穆罕默德阿里”）和第2栏有人民族（即：英语，法语，中文）。

在我的代码中，我使用所有数据创建了pandas数据框。然后创建其他数据框：一个只有中文名称，另一个只有非中文名称。然后我创建了单独的列表。

three_split函数通过将每个名称的特征拆分为三个字符的子字符串来提取它们。例如，“Katy Perry”分为“kat”，“aty”，“ty”，“y p”......等等。

然后我和Naive Bayes一起训练，最后测试结果。

运行我的代码时没有任何错误，但是当我尝试直接从数据库使用非中文名称并期望程序返回False（不是中文）时，它会返回任何名称的True（中文）我测试一下。有什么想法吗？

import pandas as pd
from pandas import DataFrame, Series
import numpy as np
import nltk
from nltk.classify import NaiveBayesClassifier as nbc
from nltk.classify import PositiveNaiveBayesClassifier

# Get csv file into data frame
data = pd.read_csv("C:\Users\KubiK\Dropbox\Python exercises_KW\_Scraping\BeautifulSoup\FamilySearch.org\FamSearch_Analysis\OddNames_sampleData3.csv", 
    encoding="utf-8")
df = DataFrame(data)
df.columns = ["name", "ethnicity"]

# Recategorize different ethnicities into 1) Chinese or 2) non-Chinese; and then create separate lists
df_chinese = df[(df["ethnicity"] == "chinese") | (df["ethnicity"] == "Chinese")]
chinese_names = list(df_chinese["name"])

df_nonchinese = df[(df["ethnicity"] != "chinese") & (df["ethnicity"] != "Chinese") & (df["ethnicity"].notnull() == True)]
nonchinese_names = list(df_nonchinese["name"])

# Function to split word string into three-character substrings
def three_split(word):
    word = str(word).lower().replace(" ", "_")
    split = 3
    return dict(("contains(%s)" % word[start:start+split], True) 
        for start in range(0, len(word)-2))

# Training naive bayes machine learning algorithm
positive_featuresets = list(map(three_split, chinese_names))
unlabeled_featuresets = list(map(three_split, nonchinese_names))
classifier = PositiveNaiveBayesClassifier.train(positive_featuresets, unlabeled_featuresets)

# Testing results
name = "Hubert Gillies" # A non-Chinese name from the dataset
print classifier.classify(three_split(name))
>>> True # Wrong output

Answer 1

可能会遇到很多问题，为什么你没有得到理想的结果，最常见的是：

功能不够强大
培训数据不足
错误的分类器
NLTK分类器中的代码错误

出于前三个原因，除非您发布指向数据集的链接，否则无法验证/解决，我们会先了解如何修复它。至于最后一个原因，基本的NaiveBayes和PositiveNaiveBayes分类器不应该是一个。

所以要问的问题是：

您有多少个训练数据实例（即行）？
在提取功能之前，为什么没有对标签进行标准化（例如中文|中文 - >中文）？
还需要考虑哪些其他功能？
您是否考虑过使用NaiveBayes代替PositiveNaiveBayes？

无法使用Pandas和NLTK在Python中训练朴素贝叶斯（机器学习）

1 个答案: