在python中设置语言环境编码

时间:2012-06-26 10:41:18

标签: java python encoding locale subprocess

我用以下方式从我的python代码调用java程序:

subprocess.check_output(["java", "-classpath", "/Users/feralvam/Programas/semanticvectors-3.4/semanticvectors-3.4.jar:/Users/feralvam/Programas/lucene-3.5.0/lucene-core-3.5.0.jar:/Users/feralvam/Programas/lucene-3.5.0/contrib/demo/lucene-demo-3.5.0.jar:", "pitt.search.semanticvectors.CompareTerms", "-queryvectorfile","/Users/feralvam/termvectors.bin",term1,term2])

“term1”和“term2”是从UTF-8编码的文本文件中读取的字符串。

当我从PyDev(Eclipse 3.7.2中的2.5版)运行此命令时,我得到以下输出: (这里,“term1”=“Eles”和“term2”=“é”)

Jun 26, 2012 11:20:55 AM pitt.search.semanticvectors.CompareTerms main
INFO: Opened query vector store from file: /Users/feralvam/termvectors.bin
Jun 26, 2012 11:20:55 AM pitt.search.semanticvectors.CompareTerms main
INFO: Couldn't open Lucene index at 
Jun 26, 2012 11:20:55 AM pitt.search.semanticvectors.CompareTerms main
INFO: No Lucene index for query term weighting, so all query terms will have same weight.
Didn't find vector for 'Eles'
No vector for 'Eles'
Didn't find vector for '??'
No vector for '??'
Jun 26, 2012 11:20:55 AM pitt.search.semanticvectors.CompareTerms main
INFO: Outputting similarity of "Eles" with "??" ...

但是如果我从终端运行相同的命令,我得到:

Jun 26, 2012 11:30:26 AM pitt.search.semanticvectors.CompareTerms main
INFO: Opened query vector store from file: /Users/feralvam/termvectors.bin
Jun 26, 2012 11:30:26 AM pitt.search.semanticvectors.CompareTerms main
INFO: Couldn't open Lucene index at 
Jun 26, 2012 11:30:26 AM pitt.search.semanticvectors.CompareTerms main
INFO: No Lucene index for query term weighting, so all query terms will have same weight.
Didn't find vector for 'Eles'
No vector for 'Eles'
Found vector for 'é'
Jun 26, 2012 11:30:26 AM pitt.search.semanticvectors.CompareTerms main
INFO: Outputting similarity of "Eles" with "é" ...

不考虑SemanticVector如何工作,问题是在第二种情况下,“term2”以正确的编码传递,但在第一种情况下不会发生。

现在,使用此命令:

print locale.getpreferredencoding(), sys.getdefaultencoding()

我收到以下信息:US-ASCII utf-8(在PyDev中)和UTF-8 ascii(在终端中)

所以我认为正在发生的是它使用US-ASCII编码来传递参数,因此结果是错误的,因为这些单词没有正确的编码。 顺便说一下,我正在使用python 2.7。

有没有办法改变这个?

感谢您提供的任何帮助。

1 个答案:

答案 0 :(得分:2)

启动进程时,可以在LANG环境变量中传递语言环境名称。 做类似的事情:

env = os.environ.copy()
env['LANG'] = 'en_US.UTF-8'
subprocess.check_output( ..., env = env)