NLP中的文本过滤算法

时间:2014-02-09 11:37:05

标签: java sql nlp yql text-mining

我正在建立一个问题/回答系统,我的程序从互联网上读取数据并回复。为此,我创建了一个java程序来从YQL(雅虎查询语言)

获取数据
    String baseUrl = "http://query.yahooapis.com/v1/public/yql?q=";

    String query = "select ChosenAnswer from answers.search where query=\"what is benzene ring\";

    String fullUrlStr = baseUrl + URLEncoder.encode(query, "UTF-8") + "&format=json";
    JSONObject json = readJsonFromUrl(fullUrlStr);

这些是我得到的结果中的一小部分:

     "ChosenAnswer": "A benzene ring is a hexagonal (6 sided) ring of 6 carbon atoms.  Each carbon has one single bond and one double bond and one hydrogen so that each carbon has 4 bonds total.  The double bonds can alternate with the single bonds so that the result is a pi electron cloud in a ring shape above and below the plane of the carbon ring.  Benzene rings are very common and stable.  Because they have double bonds, they are not saturated.  If you saturated benzene, you would get cyclohexane."
    }
     "ChosenAnswer": "The IUPAC name for Benzene Ring is Benzene. It forms the basis for other IUPAC-named benzene derivatives like 1,2-dimethylbenzene etc. \n\nBenzene as a substituent group is called the phenyl group. (e.g. phenylethylamine\n\nBenzene is the IUPAC name for an aromatic hydrocarbon with the formula C6H6. It is also called benzol, or cyclohexa-1,3,5-triene. \n  \n\nUses of Benzene - As an industrial solvent for fats and oils, rubber, resins etc. As a starting material for dyes, drugs, perfumes and explosives and polymers For dry-cleaning of woollen cl.."
    },
    {
     "ChosenAnswer": "Benzene rings aren't metallic bonds.  Metallic bonds have the special property of having a \"sea of electrons\" which basically means that the electrons don't really belong to any single atom and just kind of flow around the element.  That is pretty much what makes metals conduct electricity well.  Other materials that don't form metallic bonds can conduct electricity as well, but usually these consists of certain ions so they have positive or negative charges.  A benzene rings doesn't have any charge with it so it's definitely not going to conduct electricity in this way."
    },
    {
     "ChosenAnswer": "You can draw the benzene ring in any orientation you would like, point up or side up, although a point up is more common.  You can also draw the alternating single and double bonds any way you want.  Although you should keep in mind that there really aren't alternating single and double bonds.  Every C-C bond in benzene is exactly alike and the bonds have characteristics that are half-way between a single bond and a double bond.  That is why you will also see benzene with a circle in the middle.  Benzene exhibits delocalized pi bonding that accounts for the many interesting properties of C6H6.\n\n."
    },
    {
     "ChosenAnswer": "Benzene and phenyl seems look the same because they are both aromatic and all aromatic compounds are based on benzene C6H6.  Phenyl or phenyl functional group is a hydrocarbon derived from benzene by removing 1 H, making it a C6H5 then attaching it to something else."
    },
    {
     "ChosenAnswer": "since benzene is an aromatic compound it is highly stable which means it has to be activated.you can activate benzene by adding electron donor groups to his ring(electron donor groups: -OH,-CH3).the electron donor groups stabilize the ring(they help to maintain resonance of the ring by delocalizing electrons into it) while the reaction occurs.benzene undergoes the reactions called electrophilic  aromatic substitution so check that out it will be more clear to you then."
    },
    {
     "ChosenAnswer": "C6H6 is benzene.  It is a 6 membered carbon ring each carbon has one H bonded to it and ONE resonance structure is with double bonds alternating between single bonds.  Ary is a radical meaning it is used to describe the benzene ring portion of a molecule that has some other group attached where one of the H's was.  C6H12 if you are discusing a single ring structure is cyclohexane."
    },
    {
     "ChosenAnswer": "benzene ring is mostly a compound of carbon and hydrogen in a hexagon shape structure"
    },

现在这里的大多数答案都是基于意见的答案(因为我们都知道雅虎的答案,任何人都会回答)。但我必须想办法过滤这些答案。要么我可以通过使用一些连接等使我的查询更有效,或者我可以使用一些算法(可能是余弦相似性)来过滤答案并获得最大前3个有效答案。请给我一些算法,我可以在java中实现,以获得相关的答案。 例如,在上面的情况中,第一个'ChosenAnswer'是最合适的一个。 (我知道这是一个很大的话题,我只想知道一些我可以在这里使用的好算法)

2 个答案:

答案 0 :(得分:0)

取决于你的效率是什么意思。

您有很多选择,很大程度上取决于查询的类型。例如,如果有效意味着更科学,那么可能的尝试是将查询与维基百科概念相匹配,然后使用TFIDF来计算雅虎答案和维基文档之间的距离。具有较高相似性的那个具有更高的重要性等。

link:http://en.wikipedia.org/wiki/Vector_space_model

您也可以实施更多加权方案,然后结合使用它们来获得答案的最终权重。

答案 1 :(得分:0)

我不知道你是如何获得ChonsenAnswers列表的,但我认为这里的问题是你比较可能的答案,考虑到查询只有几个单词,你将获得类似的结果大多数答案。

我会从查询中获取重要的单词(意思是没有停用词),并从WordNet中获取它们的同义词定义。这样你就可以有更多的单词来构造tf-idf的频率表,或者构建一个向量空间模型,或者训练一个朴素的贝叶斯。无论您决定使用哪种算法,这都会改善您的结果。