我有很多字符串,如下,
ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
我使用NLTK删除日期行部分并识别日期,地点和人名?
使用pos标记我可以找到词性。但我需要确定位置,日期,人名。我怎么能这样做?
更新:
注意:我不想再执行另一个http请求。我需要使用自己的代码解析它。如果有图书馆可以使用它。
更新
我使用ne_chunk
。但没有运气。
import nltk
def pchunk(t):
w_tokens = nltk.word_tokenize(t)
pt = nltk.pos_tag(w_tokens)
ne = nltk.ne_chunk(pt)
print ne
# txts is a list of those 3 sentences.
for t in txts:
print t
pchunk(t)
输出正在跟随,
ISLAMABAD: Chief Justice Iftikhar Muhammad Chaudhry said that National Accountab
(S
ISLAMABAD/NNP
:/:
Chief/NNP
Justice/NNP
(PERSON Iftikhar/NNP Muhammad/NNP Chaudhry/NNP)
said/VBD
that/IN
(ORGANIZATION National/NNP Accountab/NNP))
KARACHI, July 24 -- Police claimed to have arrested several suspects in separate
(S
(GPE KARACHI/NNP)
,/,
July/NNP
24/CD
--/:
Police/NNP
claimed/VBD
to/TO
have/VB
arrested/VBN
several/JJ
suspects/NNS
in/IN
separate/JJ)
ALUM KULAM, Sri Lanka -- As gray-bellied clouds started to blot out the scorchin
(S
(GPE ALUM/NN)
(ORGANIZATION KULAM/NN)
,/,
(PERSON Sri/NNP Lanka/NNP)
--/:
As/IN
gray-bellied/JJ
clouds/NNS
started/VBN
to/TO
blot/VB
out/RP
the/DT
scorchin/NN)
仔细检查。即使 KARACHI 也被很好地识别,但斯里兰卡被识别为人, ISLAMABAD 被识别为NNP而不是GPE。
答案 0 :(得分:1)
答案 1 :(得分:1)
Yahoo有一个placefinder API,可以帮助识别地点。看起来这些地方总是处于起步阶段,因此可能值得采用前两个单词并将其扔到API,直到它达到极限:
http://developer.yahoo.com/boss/geo/
也值得一看,使用可怕的REGEX来识别资本: Regular expression for checking if capital letters are found consecutively in a string?
祝你好运!