我想像这样编辑我的文字:
arr = []
# arr is full of tokenized words from my text
例如:
"Abraham Lincoln Hotel is very beautiful place and i want to go there with
Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."
编辑:基本上我想检测正确的名称并使用istitle()和isAlpha()在for语句中对它们进行分组,如:
for i in arr:
if arr[i].istitle() and arr[i].isAlpha
在示例arr中出现,直到下一个单词没有他的第一个字母大写。
arr[0] + arr[1] + arr[2] = arr[0]
#Abraham Lincoln Hotel
这就是我想要的新arr:
['Abraham Lincoln Hotel'] is very beautiful place and i want to go there with
['Barbara Palvin']. ['Also'] there are stores like ['Adidas'], ['Nike'],
['Reebok'].
"此外"对我来说不是问题,当我尝试与我的数据集匹配时,它会很有用。
答案 0 :(得分:1)
你可以这样做:
sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas, Nike, Reebok."
all_words = sentence.split()
last_word_index = -100
proper_nouns = []
for idx, word in enumerate(all_words):
if(word.istitle() and word.isalpha()):
if(last_word_index == idx-1):
proper_nouns[-1] = proper_nouns[-1] + " " + word
else:
proper_nouns.append(word)
last_word_index = idx
print(proper_nouns)
此代码将:
答案 1 :(得分:0)
这是你要问的吗?
sentence = "Abraham Lincoln Hotel is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike , Reebok."
chars = ".!?," # Characters you want to remove from the words in the array
table = chars.maketrans(chars, " " * len(chars)) # Create a table for replacing characters
sentence = sentence.translate(table) # Replace characters with spaces
arr = sentence.split() # Split the string into an array whereever a space occurs
print(arr)
输出结果为:
['Abraham',
'Lincoln',
'Hotel',
'is',
'very',
'beautiful',
'place',
'and',
'i',
'want',
'to',
'go',
'there',
'with',
'Barbara',
'Palvin',
'Also',
'there',
'are',
'stores',
'like',
'Adidas',
'Nike',
'Reebok']
注意此代码:chars
变量中的任何字符都将从数组中的字符串中删除。 Explenation在代码中。
要删除非名称,请执行以下操作:
import string
new_arr = []
for i in arr:
if i[0] in string.ascii_uppercase:
new_arr.append(i)
此代码将包含以大写字母开头的所有单词。
要解决此问题,您需要将chars
更改为:
chars = ","
并将上述代码更改为:
import string
new_arr = []
end = ".!?"
b = 1
for i in arr:
if i[0] in string.ascii_uppercase and arr[b-1][-1] not in end:
new_arr.append(i)
b += 1
这将输出:
['Abraham',
'Lincoln',
'Hotel',
'Barbara',
'Palvin.',
'Adidas',
'Nike',
'Reebok.']