我想tokenize input file in python
请建议我,我是python的新用户。
我阅读了关于正则表达式的一些内容,但仍然有些混乱,所以请为此建议任何链接或代码概述。
答案 0 :(得分:9)
尝试这样的事情:
import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens
NLTK教程还有很多易于理解的示例:http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html
答案 1 :(得分:0)
NLTK
如果您的文件很小:
with open(...) as x
,.read()
并使用word_tokenize()
[代码]:
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
tokens = word_tokenize(fin.read())
如果您的文件较大:
with open(...) as x
,word_tokenize()
[代码]:
from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt') as fout:
for line in fin:
tokens = word_tokenize(line)
print(' '.join(tokens), end='\n', file=fout)
from __future__ import print_function
from spacy.tokenizer import Tokenizer
tokenizer = Tokenizer(nlp.vocab)
with open ('myfile.txt') as fin, open('tokens.txt') as fout:
for line in fin:
tokens = tokenizer.tokenize(line)
print(' '.join(tokens), end='\n', file=fout)
答案 2 :(得分:0)
with open ("file.txt", "r") as f1:
data=str(f1.readlines())
sent_tokenize(data)