我目前有一个包含类似
的列表的文件example = ['Mary had a little lamb' ,
'Jack went up the hill' ,
'Jill followed suit' ,
'i woke up suddenly' ,
'it was a really bad dream...']
“example”是这样的句子的列表,我希望输出看起来像:
mod_example = ["'Mary' 'had' 'a' 'little' 'lamb'" , 'Jack' 'went' 'up' 'the' 'hill' ....]
等等。
我需要将每个单词标记为单独的句子,以便我可以将mod_example
(一次使用for循环)的句子中的每个单词与参考句子进行比较。
我试过了:
for sentence in example:
text3 = sentence.split()
print text3
并得到以下输出:
['it', 'was', 'a', 'really', 'bad', 'dream...']
如何为所有句子获取此内容? 它会一直覆盖。是的,还要提一下我的方法是否正确? 这应该是一个单词列表,标记为tokenized .. thanks
答案 0 :(得分:19)
您可以在NLTK(http://nltk.org/api/nltk.tokenize.html)中使用带有列表理解的单词tokenizer,请参阅http://docs.python.org/2/tutorial/datastructures.html#list-comprehensions
>>> from nltk.tokenize import word_tokenize
>>> example = ['Mary had a little lamb' ,
... 'Jack went up the hill' ,
... 'Jill followed suit' ,
... 'i woke up suddenly' ,
... 'it was a really bad dream...']
>>> tokenized_sents = [word_tokenize(i) for i in example]
>>> for i in tokenized_sents:
... print i
...
['Mary', 'had', 'a', 'little', 'lamb']
['Jack', 'went', 'up', 'the', 'hill']
['Jill', 'followed', 'suit']
['i', 'woke', 'up', 'suddenly']
['it', 'was', 'a', 'really', 'bad', 'dream', '...']
答案 1 :(得分:1)
我制作这个脚本是为了让所有人都了解如何标记化,因此他们可以自己构建自己的自然语言处理引擎。
import re
from contextlib import redirect_stdout
from io import StringIO
example = 'Mary had a little lamb, Jack went up the hill, Jill followed suit, i woke up suddenly, it was a really bad dream...'
def token_to_sentence(str):
f = StringIO()
with redirect_stdout(f):
regex_of_sentence = re.findall('([\w\s]{0,})[^\w\s]', str)
regex_of_sentence = [x for x in regex_of_sentence if x is not '']
for i in regex_of_sentence:
print(i)
first_step_to_sentence = (f.getvalue()).split('\n')
g = StringIO()
with redirect_stdout(g):
for i in first_step_to_sentence:
try:
regex_to_clear_sentence = re.search('\s([\w\s]{0,})', i)
print(regex_to_clear_sentence.group(1))
except:
print(i)
sentence = (g.getvalue()).split('\n')
return sentence
def token_to_words(str):
f = StringIO()
with redirect_stdout(f):
for i in str:
regex_of_word = re.findall('([\w]{0,})', i)
regex_of_word = [x for x in regex_of_word if x is not '']
for word in regex_of_word:
print(regex_of_word)
words = (f.getvalue()).split('\n')
我进行了不同的处理,我从段落重新开始了该处理,以使每个人都对文字处理有所了解。要处理的段落是:
example = 'Mary had a little lamb, Jack went up the hill, Jill followed suit, i woke up suddenly, it was a really bad dream...'
将段落标记为句子:
sentence = token_to_sentence(example)
将得到:
['Mary had a little lamb', 'Jack went up the hill', 'Jill followed suit', 'i woke up suddenly', 'it was a really bad dream']
代名词:
words = token_to_words(sentence)
将得到:
['Mary', 'had', 'a', 'little', 'lamb', 'Jack', 'went, 'up', 'the', 'hill', 'Jill', 'followed', 'suit', 'i', 'woke', 'up', 'suddenly', 'it', 'was', 'a', 'really', 'bad', 'dream']
我将解释其工作原理。
首先,我使用正则表达式搜索所有单词和将单词分隔开的空格,直到找到标点为止,正则表达式为:
([\w\s]{0,})[^\w\s]{0,}
因此计算将使用括号中的单词和空格:
'(Mary had a little lamb),( Jack went up the hill, Jill followed suit),( i woke up suddenly),( it was a really bad dream)...'
结果仍然不清楚,包含一些“无”字符。所以我用这个脚本删除了'None'字符:
[x for x in regex_of_sentence if x is not '']
因此该段落将标记为句子,但不清楚句子的结果是:
['Mary had a little lamb', ' Jack went up the hill', ' Jill followed suit', ' i woke up suddenly', ' it was a really bad dream']
如您所见,结果显示一些句子以空格开头。因此,为了使一个清晰的段落而没有一个空格,我做了这个正则表达式:
\s([\w\s]{0,})
它会说一个清晰的句子,例如:
['Mary had a little lamb', 'Jack went up the hill', 'Jill followed suit', 'i woke up suddenly', 'it was a really bad dream']
所以,我们必须进行两个过程才能取得良好的结果。
您问题的答案是从这里开始...
要将句子标记为单词,我进行了段落迭代,并使用正则表达式只是在正则表达式进行迭代时捕获单词:
([\w]{0,})
并使用以下命令再次清除空白字符:
[x for x in regex_of_word if x is not '']
所以结果实际上只清除了单词列表:
['Mary', 'had', 'a', 'little', 'lamb', 'Jack', 'went, 'up', 'the', 'hill', 'Jill', 'followed', 'suit', 'i', 'woke', 'up', 'suddenly', 'it', 'was', 'a', 'really', 'bad', 'dream']
为了将来成为一个好的NLP,您需要拥有自己的短语数据库并搜索短语中是否包含该短语,在列出短语列表之后,其余单词将清除一个单词。
使用这种方法,我可以用我的语言(印度尼西亚语)构建我自己的NLP,而这实际上是缺少模块的。
编辑:
我看不到您想要比较两个字的问题。因此,您还有另一句话可以比较...。我不仅给您奖金,还给您如何计算。
mod_example = ["'Mary' 'had' 'a' 'little' 'lamb'" , 'Jack' 'went' 'up' 'the' 'hill']
在这种情况下,您必须执行的步骤是: 1.修改mod_example 2.将第一句与mod_example中的单词进行比较。 3.做一些计算
因此脚本将是:
import re
from contextlib import redirect_stdout
from io import StringIO
example = 'Mary had a little lamb, Jack went up the hill, Jill followed suit, i woke up suddenly, it was a really bad dream...'
mod_example = ["'Mary' 'had' 'a' 'little' 'lamb'" , 'Jack' 'went' 'up' 'the' 'hill']
def token_to_sentence(str):
f = StringIO()
with redirect_stdout(f):
regex_of_sentence = re.findall('([\w\s]{0,})[^\w\s]', str)
regex_of_sentence = [x for x in regex_of_sentence if x is not '']
for i in regex_of_sentence:
print(i)
first_step_to_sentence = (f.getvalue()).split('\n')
g = StringIO()
with redirect_stdout(g):
for i in first_step_to_sentence:
try:
regex_to_clear_sentence = re.search('\s([\w\s]{0,})', i)
print(regex_to_clear_sentence.group(1))
except:
print(i)
sentence = (g.getvalue()).split('\n')
return sentence
def token_to_words(str):
f = StringIO()
with redirect_stdout(f):
for i in str:
regex_of_word = re.findall('([\w]{0,})', i)
regex_of_word = [x for x in regex_of_word if x is not '']
for word in regex_of_word:
print(regex_of_word)
words = (f.getvalue()).split('\n')
def convert_to_words(str):
sentences = token_to_sentence(str)
for i in sentences:
word = token_to_words(i)
return word
def compare_list_of_words__to_another_list_of_words(from_strA, to_strB):
fromA = list(set(from_strA))
for word_to_match in fromA:
totalB = len(to_strB)
number_of_match = (to_strB).count(word_to_match)
data = str((((to_strB).count(word_to_match))/totalB)*100)
print('words: -- ' + word_to_match + ' --' + '\n'
' number of match : ' + number_of_match + ' from ' + str(totalB) + '\n'
' percent of match : ' + data + ' percent')
#prepare already make, now we will use it. The process start with script below:
if __name__ == '__main__':
#tokenize paragraph in example to sentence:
getsentences = token_to_sentence(example)
#tokenize sentence to words (sentences in getsentences)
getwords = token_to_words(getsentences)
#compare list of word in (getwords) with list of words in mod_example
compare_list_of_words__to_another_list_of_words(getwords, mod_example)
答案 2 :(得分:1)
first_split = []
for i in example:
first_split.append(i.split())
second_split = []
for j in first_split:
for k in j:
second_split.append(k.split())
final_list = []
for m in second_split:
for n in m:
if(n not in final_list):
final_list.append(n)
print(final_list)
答案 3 :(得分:0)
对我而言,很难说,你想要做什么。
这个怎么样
exclude = set(['Mary', 'Jack', 'Jill', 'i', 'it'])
mod_example = []
for sentence in example:
words = sentence.split()
# Optionally sort out some words
for word in words:
if word in exclude:
words.remove(word)
mod_example.append('\'' + '\' \''.join(words) + '\'')
print mod_example
哪个输出
["'had' 'a' 'little' 'lamb'", "'went' 'up' 'the' 'hill'", "'followed' 'suit'",
"'woke' 'up' 'suddenly'", "'was' 'a' 'really' 'bad' 'dream...'"]
>>>
编辑: 另一个建议基于OP提供的进一步信息
example = ['Area1 Area1 street one, 4454 hikoland' ,
'Area2 street 2, 52432 hikoland, area2' ,
'Area3 ave three, 0534 hikoland' ]
mod_example = []
for sentence in example:
words = sentence.split()
# Sort out some words
col1 = words[0]
col2 = words[1:]
if col1 in col2:
col2.remove(col1)
elif col1.lower() in col2:
col2.remove(col1.lower())
mod_example.append(col1 + ': ' + ' '.join(col2))
输出
>>>> print mod_example
['Area1: street one, 4454 hikoland', 'Area2: street 2, 52432 hikoland,',
'Area3: ave three, 0534 hikoland']
>>>
答案 4 :(得分:0)
你可以使用nltk(作为@alvas suggests)和一个递归函数,它接受任何对象并将每个str标记为:
from nltk.tokenize import word_tokenize
def tokenize(obj):
if obj is None:
return None
elif isinstance(obj, str): # basestring in python 2.7
return word_tokenize(obj)
elif isinstance(obj, list):
return [tokenize(i) for i in obj]
else:
return obj # Or throw an exception, or parse a dict...
用法:
data = [["Lorem ipsum dolor. Sit amet?", "Hello World!", None], ["a"], "Hi!", None, ""]
print(tokenize(data))
输出:
[[['Lorem', 'ipsum', 'dolor', '.', 'Sit', 'amet', '?'], ['Hello', 'World', '!'], None], [['a']], ['Hi', '!'], None, []]
答案 5 :(得分:0)
在Spacy中,它非常简单:
import spacy
example = ['Mary had a little lamb' ,
'Jack went up the hill' ,
'Jill followed suit' ,
'i woke up suddenly' ,
'it was a really bad dream...']
nlp = spacy.load("en_core_web_sm")
result = []
for line in example:
sent = nlp(line)
token_result = []
for token in sent:
token_result.append(token)
result.append(token_result)
print(result)
输出将是:
[[Mary, had, a, little, lamb], [Jack, went, up, the, hill], [Jill, followed, suit], [i, woke, up, suddenly], [it, was, a, really, bad, dream, ...]]
答案 6 :(得分:0)
这也可以通过pytorch
torchtext as
from torchtext.data import get_tokenizer
tokenizer = get_tokenizer('basic_english')
example = ['Mary had a little lamb' ,
'Jack went up the hill' ,
'Jill followed suit' ,
'i woke up suddenly' ,
'it was a really bad dream...']
tokens = []
for s in example:
tokens += tokenizer(s)
# ['mary', 'had', 'a', 'little', 'lamb', 'jack', 'went', 'up', 'the', 'hill', 'jill', 'followed', 'suit', 'i', 'woke', 'up', 'suddenly', 'it', 'was', 'a', 'really', 'bad', 'dream', '.', '.', '.']