解析位置,人名,字符串的日期由NLTK

时间:2014-02-04 09:29:51

标签: python nlp nltk corpus

我有很多字符串,如下,

  1. ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
  2. KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
  3. ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
  4. 我使用NLTK删除日期行部分并识别日期,地点和人名?

    使用pos标记我可以找到词性。但我需要确定位置日期人名。我怎么能这样做?

    更新:

    注意:我不想再执行另一个http请求。我需要使用自己的代码解析它。如果有图书馆可以使用它。

    更新

    我使用ne_chunk。但没有运气。

    import nltk
    
    def pchunk(t):
        w_tokens = nltk.word_tokenize(t)
        pt = nltk.pos_tag(w_tokens)
        ne = nltk.ne_chunk(pt)
        print ne
    
    # txts is a list of those 3 sentences.
    for t in txts:                                            
        print t
        pchunk(t)
    

    输出正在跟随,

    ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
    
    (S
      ISLAMABAD/NNP
      :/:
      Chief/NNP
      Justice/NNP
      (PERSON Iftikhar/NNP Muhammad/NNP Chaudhry/NNP)
      said/VBD
      that/IN
      (ORGANIZATION National/NNP Accountab/NNP))
    
    KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
    
    (S
      (GPE KARACHI/NNP)
      ,/,
      July/NNP
      24/CD
      --/:
      Police/NNP
      claimed/VBD
      to/TO
      have/VB
      arrested/VBN
      several/JJ
      suspects/NNS
      in/IN
      separate/JJ)
    
    ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
    
    (S
      (GPE ALUM/NN)
      (ORGANIZATION KULAM/NN)
      ,/,
      (PERSON Sri/NNP Lanka/NNP)
      --/:
      As/IN
      gray-bellied/JJ
      clouds/NNS
      started/VBN
      to/TO
      blot/VB
      out/RP
      the/DT
      scorchin/NN)
    

    仔细检查。即使 KARACHI 也被很好地识别,但斯里兰卡被识别为人, ISLAMABAD 被识别为NNP而不是GPE。

2 个答案:

答案 0 :(得分:1)

如果根据您的要求使用API​​与您自己的代码相符,Wit API可以轻松为您做这件事。

enter image description here

Wit还会将日期/时间令牌解析为标准化日期。

要开始使用,您只需提供一些示例。

答案 1 :(得分:1)

Yahoo有一个placefinder API,可以帮助识别地点。看起来这些地方总是处于起步阶段,因此可能值得采用前两个单词并将其扔到API,直到它达到极限:

http://developer.yahoo.com/boss/geo/

也值得一看,使用可怕的REGEX来识别资本: Regular expression for checking if capital letters are found consecutively in a string?

祝你好运!