我正在对新闻中的标题句子进行二进制分类。 (确定新产品是否存在政治偏见) 我正在使用从https://pypi.org/project/bert-embedding/开始的Bert嵌入,将训练语句(一个原始的一个标题语句)嵌入到Dataframe中,然后将矢量化数据馈入logistic回归,但是Bert嵌入的输出数据形状不支持logistic回归模型。如何解析它以使其适合逻辑回归模型?
在我使用tifdVectorizer之前,它可以完美工作,并且输出是numpy数组,如
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
每行是一个句子的向量化数据,它是一个数组,大小为1903 我在训练数据中有516个标题。 输出形状为:
train_x.shape: (516, 1903) test_x.shape (129, 1903)
train_y.shape: (516,) test_y.shape (129,)
但是当我切换到Bert_Embedding之后 一行的输出向量是numpy数组列表,例如
[list([array([ 9.79349554e-01, -7.06475616e-01 ...... ]dtype=float32),
array([ ........ ],dtype=float32), ......................
array([ ........ ],dtype=float32)]
输出形状如下: train_x.shape:(516,1)test_x.shape(129,1) train_y.shape:(516,)test_y.shape(129,)
def transform_to_Bert(articles_file: str, classified_articles_file: str):
df = get_df_from_articles_file(articles_file, classified_articles_file)
df_train, df_test, _, _ = train_test_split(df, df.label, stratify=df.label, test_size=0.2)
bert_embedding = BertEmbedding()
df_titles_values=df_train.title.values.tolist()
result_train = bert_embedding(df_titles_values)
result_test = bert_embedding(df_test.title.values.tolist())
train_x = pd.DataFrame(result_train, columns=['A', 'Vector'])
train_x = train_x.drop(columns=['A'])
test_x = pd.DataFrame(result_test, columns=['A', 'Vector'])
test_x=test_x.drop(columns=['A'])
test_x=test_x.values
train_x=train_x.values
print(test_x)
print(train_x)
train_y = df_train.label.values
test_y = df_test.label.values
return {'train_x': train_x, 'test_x': test_x, 'train_y': train_y, 'test_y': test_y, 'input_length': train_x.shape[1], 'vocab_size': train_x.shape[1]}
列A是结果中的原始标题字符串。所以我就放下它。
下面是我使用适用于物流模型的tifd vectoriser的代码。
def transform_to_tfid(articles_file: str, classified_articles_file: str):
df = get_df_from_articles_file(articles_file, classified_articles_file)
df_train, df_test, _, _ = train_test_split(df, df.label, stratify=df.label, test_size=0.2)
vectorizer = TfidfVectorizer(stop_words='english', )
vectorizer.fit(df_train.title)
train_x= vectorizer.transform(df_train.title)
train_x=train_x.toarray()
print(type(train_x))
print(train_x)
test_x= vectorizer.transform(df_test.title)
test_x=test_x.toarray()
print(test_x)
train_y = df_train.label.values
test_y = df_test.label.values
return {'train_x': train_x, 'test_x': test_x, 'train_y': train_y, 'test_y': test_y, 'input_length': train_x.shape[1], 'vocab_size': train_x.shape[1]}
model=LogisticRegression(solver='lbfgs')
model.fit(train_x, train_y)
错误是ValueError:使用序列设置数组元素。
我期望Bert:train_x.shape: (516, 1) test_x.shape (129, 1)
的输出形状类似于tifd:train_x.shape: (516, 1903) test_x.shape (129, 1903)
的输出形状,以便适合逻辑模型
答案 0 :(得分:0)
好吧,这是我的错误,或者库作者的约定很不好:
[list([array([ 9.79349554e-01, -7.06475616e-01 ...... ]dtype=float32),
array([ ........ ],dtype=float32), ......................
array([ ........ ],dtype=float32)]
实际上是:
[[list([array([ 9.79349554e-01, -7.06475616e-01 ...... ]dtype=float32),
array([ ........ ],dtype=float32), ......................
array([ ........ ],dtype=float32)]]
所以您必须获得0索引