如何在python中对Wikipedia类别进行分组?

时间:2019-02-11 07:10:14

标签: python mediawiki wikipedia wikipedia-api mediawiki-api

对于数据集的每个概念,我都存储了相应的维基百科类别。例如,考虑以下5个概念及其对应的维基百科类别。

  • 高甘油三酯血症:FALSE
  • 酶抑制剂:// compiling with: gcc test.c `pkg-config --cflags gtk+-3.0` `pkg-config --libs gtk+-3.0` -o test #include <stdio.h> #include <gtk/gtk.h> #include <glib/gi18n.h> guint threadID = 0; guint serial_counter = 0; static gboolean serial_data (gpointer user_data) { // do something printf("counter: %d\n", serial_counter); serial_counter++; return user_data; } static void on_update_button_clicked (GtkButton* button, gpointer user_data) { if (user_data == 1) { threadID = g_timeout_add(250, serial_data, user_data); } else if (user_data == 0) { g_source_remove(threadID); threadID = 0; } } int main (int argc, char *argv[]) { GtkWidget *window; gtk_init (&argc, &argv); GtkWidget *update_button; GtkWidget *stop_button; GtkWidget *box; window = gtk_window_new (GTK_WINDOW_TOPLEVEL); gtk_window_set_title (GTK_WINDOW (window), "test.c"); box = gtk_box_new (GTK_ORIENTATION_VERTICAL, 5); update_button = gtk_button_new_with_label (_("Update")); stop_button = gtk_button_new_with_label (_("Stop")); gtk_box_pack_start (GTK_BOX (box), update_button, FALSE, FALSE, 0); gtk_box_pack_start (GTK_BOX (box), stop_button, FALSE, FALSE, 0); gtk_container_add (GTK_CONTAINER (window), box); g_signal_connect (update_button, "clicked", G_CALLBACK (on_update_button_clicked), 1); g_signal_connect (stop_button, "clicked", G_CALLBACK (on_update_button_clicked), 0); g_signal_connect (window, "destroy", G_CALLBACK (gtk_main_quit), NULL); gtk_widget_show_all (window); gtk_main (); return 0; }
  • 旁路手术:['Category:Lipid metabolism disorders', 'Category:Medical conditions related to obesity']
  • 珀斯:['Category:Enzyme inhibitors', 'Category:Medicinal chemistry', 'Category:Metabolism']
  • 气候:['Category:Surgery stubs', 'Category:Surgical procedures and techniques']

如您所见,前三个概念属于医学领域(而其余两个术语不是医学术语)。

更准确地说,我想将我的概念分为医学和非医学领域。但是,仅使用类别来划分概念非常困难。例如,即使['Category:1829 establishments in Australia', 'Category:Australian capital cities', 'Category:Metropolitan areas of Australia', 'Category:Perth, Western Australia', 'Category:Populated places established in 1829']['Category:Climate', 'Category:Climatology', 'Category:Meteorological concepts']这两个概念在医学领域,它们的类别却彼此非常不同。

因此,我想知道是否有一种方法可以获取类别中的enzyme inhibitor(例如,bypass surgeryparent category的类别属于{{1} }父类别)

我当前正在使用enzyme inhibitorbypass surgery。但是,我不仅限于这两个库,并且很高兴也可以使用其他库来解决。

编辑

根据@IlmariKaronen的建议,我还使用了medical,得到的结果如下(pymediawiki附近的小字体是pywikibot )。 enter image description here

但是,我仍然找不到使用这些类别详细信息来确定给定术语是医学术语还是非医学术语的方法。

此外,如@IlmariKaronen使用categories of categories所指出的那样,可能是潜在的。但是,似乎category Wikiproject似乎没有所有医学术语。因此,我们还需要检查其他wiki项目。

编辑: 我当前从Wikipedia概念中提取类别的代码如下。可以使用categories of the categoryWikiproject如下进行操作。

  1. 使用库Medicine

    将MediaWiki导入为pw

    pywikibot
  2. 使用库pymediawiki

    pymediawiki

类别的类别也可以通过@IlmariKaronen的答案中所示的相同方式进行。

如果您要查找更长的测试概念列表,我在下面提到了更多示例。

p = wikipedia.page('enzyme inhibitor')
print(p.categories)

要获取很长的列表,请查看下面的链接。 https://docs.google.com/document/d/1BYllMyDlw-Rb4uMh89VjLml2Bl9Y7oUlopM-Z4F6pN0/edit?usp=sharing

注意:我不希望该解决方案能100%起作用(如果所提出的算法能够检测到许多对我足够的医学概念)

很高兴在需要时提供更多详细信息。

6 个答案:

答案 0 :(得分:6)

  

“因此,我想知道是否有一种方法可以获取类别的parent category(例如enzyme inhibitorbypass surgery的类别属于{{1 }}父类别)”

MediaWiki类别本身就是Wiki页面。 “父类别”只是“子”类别页面所属的类别。因此,您可以使用与获取其他任何Wiki页面的类别完全相同的方式来获取类别的父类别。

例如,使用pymediawiki

medical

答案 1 :(得分:5)

解决方案概述

好的,我会从多个方向解决这个问题。这里有一些很好的建议,如果我是您,我将使用这些方法的整体方法(多数表决,预测标签,在您的二元案例中,有超过50%的分类器对此表示同意)。

我正在考虑以下方法:

  • 主动学习(我下面提供的示例方法)
  • MediaWiki backlinks@TavoGC提供作为答案
  • @Stanislav Kralin和/或parent categories提供的
  • SPARQL 祖先类别作为对您的问题的评论(这两个可能是他们的集合根据他们的不同而拥有),但为此您必须联系两位创作者并比较他们的结果。

这样,三分之二的人就必须同意某个概念是医学上的概念,这将错误的可能性进一步降至最低。

在讨论这个问题时,我会反对@Meena Nagarajan@ananand_v.singh中提出的反对方法,因为:

  • 距离度量不应是欧几里得,余弦相似度是更好的度量(由this answer使用),因为它没有考虑向量的大小(并且它不应该) t,这就是word2vec或GloVe的训练方式)
  • 如果我理解正确的话,将会创建许多人造集群,而我们只需要两个集群:医学和非医学集群。此外,药物的质心不是以药物本身为中心。这带来了其他问题,例如质心远离药物,并且其他词,例如computerhuman(或您认为不适合医学的其他词)可能会进入群集。
  • 很难评估结果,甚至更是如此,这是严格主观的。此外,单词向量很难可视化和理解(对于许多单词,使用PCA / TSNE /类似物将其投射到较低的尺寸[2D / 3D]中,会给我们带来完全无意义的结果[是的,我尝试这样做,PCA对于较长的数据集,大约有5%的解释方差,真的,真的很低])。

基于上述突出的问题,我提出了使用spaCy的解决方案,这是解决此类问题的一种非常被遗忘的方法。

主动学习方法

在这部分机器学习中,当我们很难提出一种精确的算法时(例如,术语成为medical类别的一部分意味着什么),我们要求人类“专家” (实际上不一定是专家)可以提供一些答案。

知识编码

正如active learning所指出的,词向量是最有前途的方法之一,我也将在这里使用它(不过,与IMO相比,它更简洁,更轻松)。

我不会在回答中重复他的观点,所以我要加两分钱:

  • 请勿使用上下文相关的词嵌入作为当前可用的最新技术水平(例如anand_v.singh
  • 检查您有多少个概念没有表示形式(例如,表示为零的向量)。应该选中它(并在我的代码中选中它,到时候再进行讨论),您可以使用其中包含大多数嵌入内容。

使用 spaCy

测量相似度

此类用于度量编码为spaCy的GloVe单词向量的medicine与其他所有概念之间的相似性。

class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)

此代码将为每个概念返回一个数字,以衡量其与质心的相似程度。此外,它记录缺少其表示形式的概念的索引。可能会这样称呼:

import json
import typing

import numpy as np
import spacy

nlp = spacy.load("en_vectors_web_lg")

centroid = nlp("medicine")

concepts = json.load(open("concepts_new.txt"))
concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
    concepts
)

您可以用数据代替new_concepts.json

查看BERT,注意我已经使用过spacy.load。它由 685.000个唯一的单词向量组成(很多),并且可能针对您的情况开箱即用。安装spaCy后,您必须单独下载它,以上链接中提供了更多信息。

另外,您可能想使用多个质心词,例如添加诸如diseasehealth之类的单词,并将其单词向量平均。我不确定这是否会对您的案件产生正面影响。

其他可能性可能是使用多个质心,并计算每个概念和多个质心之间的相似度。在这种情况下,我们可能会有一些阈值,这可能会删除一些en_vectors_web_lg,但是可能会错过一些可能被视为与medicine类似的术语。此外,这会使情况变得更加复杂,但是,如果您的结果不令人满意,则应考虑上述两个选项(并且只有在那些情况下,不要事先考虑就不要采用这种方法)。

现在,我们对概念的相似性进行了粗略的衡量。但是,什么意思是什么概念与医学有0.1的积极相似性?这是应该归类为医学的概念吗?也许那已经太遥远了?

问专家

要获得一个阈值(以下术语将被视为非医学术语),最简单的方法是让人类为我们分类一些概念(这就是主动学习的意义所在)。是的,我知道这是一种非常简单的主动学习形式,但无论如何我都会认为。

我用sklearn-like接口编写了一个类,要求人类对概念进行分类,直到达到最佳阈值(或最大迭代次数)为止。

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        max_steps: int,
        samples: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.max_steps: int = max_steps
        self.samples: int = samples
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1
  • samples参数描述了在每次迭代过程中将向专家显示多少示例(这是最大值,如果已经请求了样本或样本不足以显示样本,则返回的值会更少)。 / li>
  • step表示每次迭代中的阈值下降(我们从1开始表示完美相似)。
  • change_multiplier-如果专家回答的概念不相关(或者大多数情况下不相关,因为返回了多个),则将步骤乘以该浮点数。它用于在每次迭代的step变化之间精确定位阈值。
  • 概念是根据它们的相似性进行排序的(概念越相似,则越高)

下面的功能向专家征求意见,并根据其答案找到最佳阈值。

def _ask_expert(self, available_concepts_indices):
    # Get random concepts (the ones above the threshold)
    concepts_to_show = set(
        np.random.choice(
            available_concepts_indices, len(available_concepts_indices)
        ).tolist()
    )
    # Remove those already presented to an expert
    concepts_to_show = concepts_to_show - self._checked_concepts
    self._checked_concepts.update(concepts_to_show)
    # Print message for an expert and concepts to be classified
    if concepts_to_show:
        print("\nAre those concepts related to medicine?\n")
        print(
            "\n".join(
                f"{i}. {concept}"
                for i, concept in enumerate(
                    self.concepts[list(concepts_to_show)[: self.samples]]
                )
            ),
            "\n",
        )
        return input("[y]es / [n]o / [any]quit ")
    return "y"

示例问题如下:

Are those concepts related to medicine?                                                      

0. anesthetic drug                                                                                                                                                                         
1. child and adolescent psychiatry                                                                                                                                                         
2. tertiary care center                                                     
3. sex therapy                           
4. drug design                                                                                                                                                                             
5. pain disorder                                                      
6. psychiatric rehabilitation                                                                                                                                                              
7. combined oral contraceptive                                
8. family practitioner committee                           
9. cancer family syndrome                          
10. social psychology                                                                                                                                                                      
11. drug sale                                                                                                           
12. blood system                                                                        

[y]es / [n]o / [any]quit y

...解析专家的答案:

# True - keep asking, False - stop the algorithm
def _parse_expert_decision(self, decision) -> bool:
    if decision.lower() == "y":
        # You can't go higher as current threshold is related to medicine
        self._max_threshold = self.threshold_
        if self.threshold_ - self.step < self._min_threshold:
            return False
        # Lower the threshold
        self.threshold_ -= self.step
        return True
    if decision.lower() == "n":
        # You can't got lower than this, as current threshold is not related to medicine already
        self._min_threshold = self.threshold_
        # Multiply threshold to pinpoint exact spot
        self.step *= self.change_multiplier
        if self.threshold_ + self.step < self._max_threshold:
            return False
        # Lower the threshold
        self.threshold_ += self.step
        return True
    return False

最后是ActiveLearner的整个代码,它可以根据专家的要求找到最佳的相似阈值:

class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self

总而言之,您必须手动回答一些问题,但是我认为这种方法更加准确

此外,您不必遍历所有样本,而只是其中的一小部分。您可以决定构成医学术语的样本数量(是否显示了40个医学样本和10个非医学样本,是否仍应视为医学术语?),因此您可以根据自己的喜好微调此方法。如果存在异常值(例如,50个样本中有1个是非医学样本),我认为该阈值仍然有效。

再次:该方法应与其他方法混合使用,以最大程度地减少错误分类的可能性。

分类器

当我们从专家那里获得阈值时,分类将是瞬时的,这是一个简单的分类类:

class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions

为简便起见,这是最终的源代码:

import json
import typing

import numpy as np
import spacy


class Similarity:
    def __init__(self, centroid, nlp, n_threads: int, batch_size: int):
        # In our case it will be medicine
        self.centroid = centroid

        # spaCy's Language model (english), which will be used to return similarity to
        # centroid of each concept
        self.nlp = nlp
        self.n_threads: int = n_threads
        self.batch_size: int = batch_size

        self.missing: typing.List[int] = []

    def __call__(self, concepts):
        concepts_similarity = []
        # nlp.pipe is faster for many documents and can work in parallel (not blocked by GIL)
        for i, concept in enumerate(
            self.nlp.pipe(
                concepts, n_threads=self.n_threads, batch_size=self.batch_size
            )
        ):
            if concept.has_vector:
                concepts_similarity.append(self.centroid.similarity(concept))
            else:
                # If document has no vector, it's assumed to be totally dissimilar to centroid
                concepts_similarity.append(-1)
                self.missing.append(i)

        return np.array(concepts_similarity)


class ActiveLearner:
    def __init__(
        self,
        concepts,
        concepts_similarity,
        samples: int,
        max_steps: int,
        step: float = 0.05,
        change_multiplier: float = 0.7,
    ):
        sorting_indices = np.argsort(-concepts_similarity)
        self.concepts = concepts[sorting_indices]
        self.concepts_similarity = concepts_similarity[sorting_indices]

        self.samples: int = samples
        self.max_steps: int = max_steps
        self.step: float = step
        self.change_multiplier: float = change_multiplier

        # We don't have to ask experts for the same concepts
        self._checked_concepts: typing.Set[int] = set()
        # Minimum similarity between vectors is -1
        self._min_threshold: float = -1
        # Maximum similarity between vectors is 1
        self._max_threshold: float = 1

        # Let's start from the highest similarity to ensure minimum amount of steps
        self.threshold_: float = 1

    def _ask_expert(self, available_concepts_indices):
        # Get random concepts (the ones above the threshold)
        concepts_to_show = set(
            np.random.choice(
                available_concepts_indices, len(available_concepts_indices)
            ).tolist()
        )
        # Remove those already presented to an expert
        concepts_to_show = concepts_to_show - self._checked_concepts
        self._checked_concepts.update(concepts_to_show)
        # Print message for an expert and concepts to be classified
        if concepts_to_show:
            print("\nAre those concepts related to medicine?\n")
            print(
                "\n".join(
                    f"{i}. {concept}"
                    for i, concept in enumerate(
                        self.concepts[list(concepts_to_show)[: self.samples]]
                    )
                ),
                "\n",
            )
            return input("[y]es / [n]o / [any]quit ")
        return "y"

    # True - keep asking, False - stop the algorithm
    def _parse_expert_decision(self, decision) -> bool:
        if decision.lower() == "y":
            # You can't go higher as current threshold is related to medicine
            self._max_threshold = self.threshold_
            if self.threshold_ - self.step < self._min_threshold:
                return False
            # Lower the threshold
            self.threshold_ -= self.step
            return True
        if decision.lower() == "n":
            # You can't got lower than this, as current threshold is not related to medicine already
            self._min_threshold = self.threshold_
            # Multiply threshold to pinpoint exact spot
            self.step *= self.change_multiplier
            if self.threshold_ + self.step < self._max_threshold:
                return False
            # Lower the threshold
            self.threshold_ += self.step
            return True
        return False

    def fit(self):
        for _ in range(self.max_steps):
            available_concepts_indices = np.nonzero(
                self.concepts_similarity >= self.threshold_
            )[0]
            if available_concepts_indices.size != 0:
                decision = self._ask_expert(available_concepts_indices)
                if not self._parse_expert_decision(decision):
                    break
            else:
                self.threshold_ -= self.step
        return self


class Classifier:
    def __init__(self, centroid, threshold: float):
        self.centroid = centroid
        self.threshold: float = threshold

    def predict(self, concepts_pipe):
        predictions = []
        for concept in concepts_pipe:
            predictions.append(self.centroid.similarity(concept) > self.threshold)
        return predictions


if __name__ == "__main__":
    nlp = spacy.load("en_vectors_web_lg")

    centroid = nlp("medicine")

    concepts = json.load(open("concepts_new.txt"))
    concepts_similarity = Similarity(centroid, nlp, n_threads=-1, batch_size=4096)(
        concepts
    )

    learner = ActiveLearner(
        np.array(concepts), concepts_similarity, samples=20, max_steps=50
    ).fit()
    print(f"Found threshold {learner.threshold_}\n")

    classifier = Classifier(centroid, learner.threshold_)
    pipe = nlp.pipe(concepts, n_threads=-1, batch_size=4096)
    predictions = classifier.predict(pipe)
    print(
        "\n".join(
            f"{concept}: {label}"
            for concept, label in zip(concepts[20:40], predictions[20:40])
        )
    )

回答了一些问题后,阈值为0.1([-1, 0.1)之间的所有内容均被视为非医学,而[0.1, 1]之间的所有内容均被视为医学),我得到了以下结果:

kartagener s syndrome: True
summer season: True
taq: False
atypical neuroleptic: True
anterior cingulate: False
acute respiratory distress syndrome: True
circularity: False
mutase: False
adrenergic blocking drug: True
systematic desensitization: True
the turning point: True
9l: False
pyridazine: False
bisoprolol: False
trq: False
propylhexedrine: False
type 18: True
darpp 32: False
rickettsia conorii: False
sport shoe: True

如您所见,这种方法远非完美,因此上一节描述了可能的改进:

可能的改进

如开始时所述,将我的方法与其他答案混合使用,可能会排除诸如属于sport shoe的{​​{1}}之类的想法,而主动学习的方法在开奖的情况下将更具决定性在上述两种启发式方法之间。

我们也可以创建一个活跃的学习合奏。代替一个阈值(例如0.1),我们将使用多个阈值(增加或减少),假设它们是medicine

假设0.1, 0.2, 0.3, 0.4, 0.5得到了,对于每个阈值,它们分别是sport shoe,如下所示:

True/False

进行多数表决,我们将在2票中的3票中将其标记为True True False False False。此外,如果阈值低于该阈值,我也可以缓解过于严格的阈值(如果non-medical看起来像这样:True/False)。

我想出了可能的最终改进:在上面的代码中,我使用的是True True True False False向量,这是词向量创建这个概念的意思。假设缺少一个单词(由零组成的矢量),在这种情况下,它将被推离Doc重心。您可能不希望这样做(因为某些特殊医学术语[medicine之类的缩写或其他缩写]可能会缺少它们的表示形式),在这种情况下,您只能对那些不为零的向量进行平均。

我知道这篇文章很长,因此,如果您有任何疑问,请在下面发布。

答案 2 :(得分:4)

NLP中有一个词向量的概念,它的基本作用是通过查看大量文本,它尝试将词转换为多维向量,然后减小这些向量之间的距离,从而提高它们之间的相似性,好处是许多人已经生成了此单词向量,并已在非常宽松的许可下使它们可用,在您的情况下,您正在使用Wikipedia,并且在这里http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

现在这些将是最适合此任务的,因为它们包含了Wikipedia语料库中的大多数单词,但是如果它们不适合您,或者将来被删除,则可以使用其中的一个,我将在下面列出更多话虽如此,还有一种更好的方法可以做到这一点,即通过将它们传递给tensorflow的通用语言模型embed模块,您无需在其中进行大部分繁重的工作,就可以阅读更多内容。 here.之所以将它放在Wikipedia文本转储之后是因为我听说人们说,在处理医学样本时,使用它们有些困难。 This paper确实提出了解决此问题的解决方案,但我从未尝试过这样做,因此我不确定它的准确性。

现在,如何使用tensorflow中的单词embeddings很简单,只需做

embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
embeddings = embed(["Input Text here as"," List of strings"])
session.run(embeddings)

由于您可能不熟悉tensorflow并尝试仅运行这段代码,因此您可能会遇到一些麻烦,Follow this link他们完全提到了如何使用它,从那里您应该可以轻松地根据您的需要进行修改。

话虽如此,我建议您首先检查一下tensorlfow的embed模块及其预训练的词嵌入,如果它们对您不起作用,请查看Wikimedia链接,如果那也不起作用,请继续阅读概念我已链接的论文中。由于此答案描述的是NLP方法,因此它并不是100%准确的,因此请在继续操作之前牢记这一点。

手套向量https://nlp.stanford.edu/projects/glove/

Facebook的快速文本:https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

或者这个http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz

如果在遵循colab教程后在实施此时遇到问题,请在下面的问题和评论中添加您的问题,然后我们可以继续进行操作。

编辑为集群主题添加的代码

简要说明,我不是在使用单词向量,而是在对其摘要句子进行编码

文件content.py

def AllTopics():
    topics = []# list all your topics, not added here for space restricitons
    for i in range(len(topics)-1):
        yield topics[i]

文件summaryGenerator.py

import wikipedia
import pickle
from content import Alltopics
summary = []
failed = []
for topic in Alltopics():
    try:
        summary.append(wikipedia.summary(tuple((topic,str(topic)))))
    except Exception as e:
        failed.append(tuple((topic,e)))
with open("summary.txt", "wb") as fp:
    pickle.dump(summary , fp)
with open('failed.txt', 'wb') as fp:
    pickle.dump('failed', fp)

文件SametiyCalculator.py

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import os
import pandas as pd
import re
import pickle
import sys
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix


try:
    with open("summary.txt", "rb") as fp:   # Unpickling
        summary = pickle.load(fp)
except Exception as e:
    print ('Cannot load the summary file, Please make sure that it exists, if not run Summary Generator first', e)
    sys.exit('Read the error message')

module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)

tf.logging.set_verbosity(tf.logging.ERROR)
messages = [x[1] for x in summary]
labels = [x[0] for x in summary]
with tf.Session() as session:
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    message_embeddings = session.run(embed(messages)) # In message embeddings each vector is a second (1,512 vector) and is numpy.ndarray (noOfElemnts, 512)

X = message_embeddings
agl = AgglomerativeClustering(n_clusters=5, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', pooling_func='deprecated')
agl.fit(X)
dist_matrix = distance_matrix(X,X)
Z = hierarchy.linkage(dist_matrix, 'complete')
dendro = hierarchy.dendrogram(Z)
cluster_labels = agl.labels_

它也托管在GitHub的https://github.com/anandvsingh/WikipediaSimilarity上,您可以在其中找到similarity.txt文件和其他文件,就我而言,我无法在所有主题上都运行它,但我恳请您以在主题的完整列表上运行它(直接克隆存储库并运行SummaryGenerator.py),并通过拉取请求上载likeness.txt ,以防万一您没有得到预期的结果。并尽可能将message_embeddings上传到csv文件中作为主题并进行嵌入。

修改后的更改2 将相似性生成器切换为基于层次结构的聚类(聚集),建议您将标题名称保留在树状图的底部,并查看dendrogram here的定义,我验证了查看一些样本并且结果看起来相当很好,您可以更改n_clusters值来微调模型。注意:这要求您再次运行摘要生成器。我认为您应该可以从这里开始使用,您要做的就是尝试几个n_cluster值,然后查看将所有医学术语归为一组,然后为该聚类找到cluster_label到此为止。由于我们在这里按摘要进行分组,因此聚类将更加准确。如果您遇到任何问题或不了解某些内容,请在下面评论。

答案 3 :(得分:4)

您可以尝试通过针对每个类别返回的mediawiki链接和反向链接对Wikipedia类别进行分类

import re
from mediawiki import MediaWiki

#TermFind will search through a list a given term
def TermFind(term,termList):
    responce=False
    for val in termList:
        if re.match('(.*)'+term+'(.*)',val):
            responce=True
            break
    return responce

#Find if the links and backlinks lists contains a given term 
def BoundedTerm(wikiPage,term):
    aList=wikiPage.links
    bList=wikiPage.backlinks
    responce=False
    if TermFind(term,aList)==True and TermFind(term,bList)==True:
         responce=True
    return responce

container=[]
wikipedia = MediaWiki()
for val in termlist:
    cpage=wikipedia.page(val)
    if BoundedTerm(cpage,'term')==True:
        container.append('medical')
    else:
        container.append('nonmedical')

想法是尝试猜测大多数类别共有的术语,我尝试生物学,医学和疾病方面都取得了不错的成绩。也许您可以尝试使用对BoundedTerms的多次调用来进行分类,或者一次调用多个术语并组合结果以进行分类。希望对您有帮助

答案 4 :(得分:4)

wikipedia库也是从给定页面提取类别的不错选择,因为wikipedia.WikipediaPage(page).categories返回一个简单列表。该库还允许您搜索多个页面,如果它们都具有相同的标题。

在医学上似乎有很多关键的词根和后缀,因此,找到关键词的方法可能是找到医学术语的好方法。

import wikipedia

def categorySorter(targetCats, pagesToCheck, mainCategory):
    targetList = []
    nonTargetList = []
    targetCats = [i.lower() for i in targetCats]

    print('Sorting pages...')
    print('Sorted:', end=' ', flush=True)
    for page in pagesToCheck:

        e = openPage(page)

        def deepList(l):
            for item in l:
                if item[1] == 'SUBPAGE_ID':
                    deepList(item[2])
                else:
                    catComparator(item[0], item[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])

        if e[1] == 'SUBPAGE_ID':
            deepList(e[2])
        else:
            catComparator(e[0], e[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])

    print()
    print()
    print('Results:')
    print(mainCategory, ': ', targetList, sep='')
    print()
    print('Non-', mainCategory, ': ', nonTargetList, sep='')

def openPage(page):
    try:
        pageList = [page, wikipedia.WikipediaPage(page).categories]
    except wikipedia.exceptions.PageError as p:
        pageList = [page, 'NONEXIST_ID']
        return
    except wikipedia.exceptions.DisambiguationError as e:
        pageCategories = []
        for i in e.options:
            if '(disambiguation)' not in i:
                pageCategories.append(openPage(i))
        pageList = [page, 'SUBPAGE_ID', pageCategories]
        return pageList
    finally:
        return pageList

def catComparator(pageTitle, pageCategories, targetCats, targetList, nonTargetList, lastPage):

    # unhash to view the categories of each page
    #print(pageCategories)
    pageCategories = [i.lower() for i in pageCategories]

    any_in = False
    for i in targetCats:
        if i in pageTitle:
            any_in = True
    if any_in:
        print('', end = '', flush=True)
    elif compareLists(targetCats, pageCategories):
        any_in = True

    if any_in:
        targetList.append(pageTitle)
    else:
        nonTargetList.append(pageTitle)

    # Just prints a pretty list, you can comment out until next hash if desired
    if any_in:
        print(pageTitle, '(T)', end='', flush=True)
    else:
        print(pageTitle, '(F)',end='', flush=True)

    if pageTitle != lastPage:
        print(',', end=' ')
    # No more commenting

    return any_in

def compareLists (a, b):
    for i in a:
        for j in b:
            if i in j:
                return True
    return False

代码实际上只是将关键字和后缀列表与每个页面的标题及其类别进行比较,以确定页面是否与医学相关。它还查看较大主题的相关页面/子页面,并确定它们是否也相关。我对药物不太熟悉,因此请原谅类别,但以下是标记在底部的示例:

medicalCategories = ['surgery', 'medic', 'disease', 'drugs', 'virus', 'bact', 'fung', 'pharma', 'cardio', 'pulmo', 'sensory', 'nerv', 'derma', 'protein', 'amino', 'unii', 'chlor', 'carcino', 'oxi', 'oxy', 'sis', 'disorder', 'enzyme', 'eine', 'sulf']
listOfPages = ['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']
categorySorter(medicalCategories, listOfPages, 'Medical')

这个示例列表至少占到列表中应有列表的70%。

答案 5 :(得分:3)

这个问题对我来说似乎有点不清楚,而且似乎也不是要解决的简单问题,可能需要一些NLP模型。同样,概念和类别这两个词可以互换使用。我了解的是,酶抑制剂,搭桥手术和高甘油三酯血症等概念需要在医学上结合在一起,而其余的则在非医学上结合起来。此问题将需要更多的数据,而不仅仅是类别名称。需要一个语料库来训练一个LDA模型(例如),在该模型中将整个文本信息馈送到算法中,并返回每个概念最有可能的主题。

https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/