Python - 组顺序数组成员

时间:2016-04-18 07:50:01

标签: python nlp nltk stanford-nlp opennlp

我想像这样编辑我的文字:

arr = [] 
# arr is full of tokenized words from my text

例如:

"Abraham Lincoln Hotel is very beautiful place and i want to go there with
 Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."

编辑:基本上我想检测正确的名称并使用istitle()和isAlpha()在for语句中对它们进行分组,如:

for i in arr:
    if arr[i].istitle() and arr[i].isAlpha

在示例arr中出现,直到下一个单词没有他的第一个字母大写。

arr[0] + arr[1] + arr[2] = arr[0]
#Abraham Lincoln Hotel

这就是我想要的新arr:

['Abraham Lincoln Hotel'] is very beautiful place and i want to go there with ['Barbara Palvin']. ['Also'] there are stores like ['Adidas'], ['Nike'], ['Reebok'].

"此外"对我来说不是问题,当我尝试与我的数据集匹配时,它会很有用。

2 个答案:

答案 0 :(得分:1)

你可以这样做:

sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas, Nike, Reebok."
all_words = sentence.split()
last_word_index = -100
proper_nouns = []
for idx, word in enumerate(all_words):
    if(word.istitle() and word.isalpha()):
        if(last_word_index == idx-1):
            proper_nouns[-1] = proper_nouns[-1] + " " + word
        else:
            proper_nouns.append(word)
        last_word_index = idx
print(proper_nouns)

此代码将:

  • 将所有字词拆分为列表
  • 迭代所有的单词和
    • 如果最后一个大写单词是前一个单词,它会将它附加到列表中的最后一个条目
    • 否则它会将该单词存储为列表中的新条目
    • 记录找到大写单词的最后一个索引

答案 1 :(得分:0)

这是你要问的吗?

sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."

chars = ".!?,"                                   # Characters you want to remove from the words in the array

table = chars.maketrans(chars, " " * len(chars)) # Create a table for replacing characters
sentence = sentence.translate(table)             # Replace characters with spaces

arr = sentence.split()                           # Split the string into an array whereever a space occurs

print(arr)

输出结果为:

['Abraham',
 'Lincoln',
 'Hotel',
 'is',
 'very',
 'beautiful',
 'place',
 'and',
 'i',
 'want',
 'to',
 'go',
 'there',
 'with',
 'Barbara',
 'Palvin',
 'Also',
 'there',
 'are',
 'stores',
 'like',
 'Adidas',
 'Nike',
 'Reebok']

注意此代码:chars变量中的任何字符都将从数组中的字符串中删除。 Explenation在代码中。

要删除非名称,请执行以下操作:

import string
new_arr = []

for i in arr:
    if i[0] in string.ascii_uppercase:
        new_arr.append(i)

此代码将包含以大写字母开头的所有单词。

要解决此问题,您需要将chars更改为:

chars = ","

并将上述代码更改为:

import string
new_arr = []
end = ".!?"    

b = 1
for i in arr:
    if i[0] in string.ascii_uppercase and arr[b-1][-1] not in end:
        new_arr.append(i)
    b += 1

这将输出:

['Abraham', 
'Lincoln', 
'Hotel', 
'Barbara', 
'Palvin.', 
'Adidas', 
'Nike',
'Reebok.']