Question

TL；最后的灾难恢复

几个月来，我一直在致力于使用多个框架对文档进行特征提取，而最近该项目已经走到了尽头。

我的目标是在文档中找到任何类型的标识字符串。

预先感谢

让我们说该项目被组织成分别称为“模块”的模块，而最新的开发中的模块旨在像我所说的那样在文档中查找标识号。

例如：

“ A / 364”是有效的标识符。
“ 137。23”是有效的标识符。
不是“ 05.08.2019”（这是日期）。

为避免将模块应用于整个文档并提高准确性，我在查找标签并提取了位于其附近或右侧或下方的文本（我暂时遵循西方的阅读顺序），从左到右，从上到下）。这部分工作

Just for the sake of further explanation, imagine that we're going to extract dates, we'll find a Label corresponding to "date" or something along those lines and then we'd apply either regex or some other solution to find a date.

问题是，没有一种将标识符分配给某物的标准化方法，因此您可以期望使用任何类型的字符串。

经过漫长的介绍：

我尝试过的方法： *将一个或多个正则表达式应用于提取的文本。

以及我正在尝试的解决方案（标题未显示结果） *将提取的文本转换为单个字符串，用通用字符替换每种字符，然后使用scikit应用n-gram。

使用以下功能，将替换任何给定的字符串，以便字母为“ a”，大写字母“ b”，数字“ c”，空格“ d”等...

def st_to_chars(in_string):
    in_string=re.sub("[a-z]", "a",in_string)
    in_string=re.sub("[A-Z]","b",in_string)
    in_string=re.sub("[0-9]","c",in_string)
    in_string=re.sub(" ","d",in_string)
    in_string=re.sub("[\\\/\|]","e",in_string)
    in_string=re.sub("[\-\_]","f",in_string)
    in_string=re.sub("[\.\;]","g",in_string)
    in_string=re.sub("[\,]","h",in_string)
    in_string=re.sub("[€\$\¥]","i",in_string)
    in_string=re.sub("[^a-z]","j",in_string)

    return in_string

起初，这似乎是一个好方法，因为我们的精度为0.931280

#Create classifier and vectorizer
clf = MultinomialNB(alpha=0.1)
clf2 = LinearSVC(random_state=0, tol=1e-5)
vec2 = CountVectorizer(analyzer='char_wb', ngram_range=(2,4), min_df=1)

df = pd.read_csv("dataset.csv", delimiter = ",", quotechar='"')
df = df[pd.notnull(df['code'])]
df = df.sample(frac=1).reset_index(drop=True)
df['new_code']=df['code'].apply(st_to_chars)

y_train = df['label'][0:45000].tolist()
data_train  =df['new_code'][0:45000].tolist()
x_train = vec2.fit_transform(data_train)
clf2 = LinearSVC(random_state=0, tol=1e-5)

y_true = df['label'][45000:].tolist()
data_test  =df['new_code'][45000:].tolist()
x_test = vec2.transform(data_test)


y_test = clf.predict(x_test)
sklearn.metrics.accuracy_score(y_true, y_test)

数据集包含超过75,000行的手动标记为True和False的标识符。 55％〜都是1（是标识符），其余为False

使用上述设置，结果如下：

original strings: ["123/2111/0gg" , "644160949" , "B2921113", "27/04/1997", "foobar"]
replaced strings: ['ccceccccecaa', 'tcccccccc', 'bccccccc', 'cceccecccc', 'aaaaa']
Expected result: ['1', '0', '0', '0', '0']
Actual result:   ['1', '1', '1', '1', '0']

由于这种方法似乎行不通，所以我有点茫然。鉴于最终是二进制文本分类，接下来我应该采用哪种方法？我想很难对这种字符串进行分类，但是即使准确度达到65-75％，我也可以。

我还尝试了多项朴素贝叶斯和SVC。

TL; DR

研究二进制文本分类器（True或False），目的是区分字符串是否是标识符。

使用朴素贝叶斯（Naive Bayes）和CounterVectorizer，我的准确率达到93％，但出现了很多误报。我已经尝试过SVC，并且数据集包含约40k行的True字符串和约35k行的False案例，这些行已手动标记。

硬件不是问题。

您能给我任何建议吗？有什么办法吗？

非常感谢。

字符串分类短，误报率高。 ¿我们走的路正确吗？

0 个答案: