我是新的python程序员。我写了一个简单的脚本,它正在执行以下操作:
我在两个文件中获取标记化的结果。一个人有拉丁字符(英语,西班牙语等),另一个有其他人(希腊语等)。
问题在于,当我打开一个希腊网址时,我从中取出希腊语,但我认为它是一系列字符,而不是单词(就像拉丁文中的情况一样)。
我希望得到一个单词列表(μαρια
,γιωργος
,παιδι
)(项目数3),但我所采取的是('μ','α','ρ','ι', 'α'........)
个项目数和字母一样多
我该怎么办? (编码为utf-8)
遵循以下代码:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#Importing useful libraries
#NOTE: Nltk should be installed first!!!
import nltk
import urllib #mporei na einai kai urllib
import re
import lxml.html.clean
import unicodedata
from urllib import urlopen
http = "http://"
www = "www."
#pattern = r'[^\a-z0-9]'
#Demand url from the user
url=str(raw_input("Please, give a url and then press ENTER: \n"))
#Construct a valid url syntax
if (url.startswith("http://"))==False:
if(url.startswith("www"))==False:
msg=str(raw_input("Does it need 'www'? Y/N \n"))
if (msg=='Y') | (msg=='y'):
url=http+www+url
elif (msg=='N') | (msg=='n'):
url=http+url
else:
print "You should type 'y' or 'n'"
else:
url=http+url
latin_file = open("Latin_words.txt", "w")
greek_file = open("Other_chars.txt", "w")
latin_file.write(url + '\n')
latin_file.write("The latin words of the above url are the following:" + '\n')
greek_file.write("Οι ελληνικές λέξεις καθώς και απροσδιόριστοι χαρακτήρες")
#Reading the given url
raw=urllib.urlopen(url).read()
#Retrieve the html body from the url. Clean it from html special characters
pure = nltk.clean_html(raw)
text = pure
#Retrieve the words (tokens) of the html body in a list
tokens = nltk.word_tokenize(text)
counter=0
greeks=0
for i in tokens:
if re.search('[^a-zA-Z]', i):
#greeks+=1
greek_file.write(i)
else:
if len(i)>=4:
print i
counter+=1
latin_file.write(i + '\n')
else:
del i
#Print the number of words that I shall take as a result
print "The number of latin tokens is: %d" %counter
latin_file.write("The number of latin tokens is: %d and the number of other characters is: %d" %(counter, greeks))
latin_file.close()
greek_file.close()
我在很多方面检查了它,并且,据我所知,该程序只识别希腊字符,但无法识别希腊字,意思是,与女巫分开单词的空间!
如果我在终端中键入带空格的希腊语句子,则表示正确。当我读到某些内容时(例如来自html页面的正文)
会出现问题另外,在text_file.write(i)中,关于希腊语i,如果我写了text_file.write(i +'\ n'),结果是未识别的字符,也就是说,我丢失了我的编码!
有关上述内容的任何想法?
答案 0 :(得分:0)
这里我认为你正在寻找子串而不是字符串if re.search('[^a-zA-Z]', i)
您可以通过循环列表token
答案 1 :(得分:0)
Python re
模块因其弱的unicode支持而臭名昭着。对于严肃的unicode工作,请考虑替代regex module,它完全支持unicode脚本和属性。例如:
text = u"""
Some latin words, for example: cat niño määh fuß
Οι ελληνικές λέξεις καθώς και απροσδιόριστοι χαρακτήρες
"""
import regex
latin_words = regex.findall(ur'\p{Latin}+', text)
greek_words = regex.findall(ur'\p{Greek}+', text)
答案 2 :(得分:0)
以下是您的代码的简化版本,使用优秀的requests
library来获取网址,with
statement自动关闭文件,使用io
来帮助使用utf8。
import io
import nltk
import requests
import string
url = raw_input("Please, give a url and then press ENTER: \n")
if not url.startswith('http://'):
url = 'http://'+url
page_text = requests.get(url).text
tokens = nltk.word_tokenize(page_text)
latin_words = [w for w in tokens if w.isalpha()]
greek_words = [w for w in tokens if w not in latin_words]
print 'The number of latin tokens is {0}'.format(len(latin_words))
with (io.open('latin_words.txt','w',encoding='utf8') as latin_file,
io.open('greek_words.txt','w',encoding='utf8') as greek_file):
greek_file.writelines(greek_words)
latin_file.writelines(latin_words)
latin_file.write('The number of latin words is {0} and the number of others {1}\n'.format(len(latin_words),len(greek_words))
我简化了检查URL的部分;这样就无法读取无效的网址。