在准确探索Java标识符中允许哪些字符时,我偶然发现了一些非常好奇的东西,似乎几乎肯定是一个bug。
我希望发现Java标识符符合以下要求:它们以具有Unicode属性ID_Start
的字符开头,后面跟着具有属性ID_Continue
的字符,并且允许例外领先的下划线和美元符号。事实证明并非如此,而且我发现与我听说过的普通标识符或其他任何其他想法极为不同。
请考虑以下演示,证明Java标识符中允许使用ASCII ESC字符(八进制033):
$ perl -le 'print qq(public class escape { public static void main(String argv[]) { String var_\033 = "i am escape: \033"; System.out.println(var_\033); }})' > escape.java
$ javac escape.java
$ java escape | cat -v
i am escape: ^[
但是,情况甚至更糟。实际上,几乎无限恶化。甚至允许NULL!还有数千个甚至不是标识符字符的其他代码点。我在Solaris,Linux和运行Darwin的Mac上测试了这一点,并且都给出了相同的结果。
这是一个测试程序,它将显示Java非常不允许作为合法标识符名称一部分的所有这些意外代码点。
#!/usr/bin/env perl
#
# test-java-idchars - find which bogus code points Java allows in its identifiers
#
# usage: test-java-idchars [low high]
# e.g.: test-java-idchars 0 255
#
# Without arguments, tests Unicode code points
# from 0 .. 0x1000. You may go further with a
# higher explicit argument.
#
# Produces a report at the end.
#
# You can ^C it prematurely to end the program then
# and get a report of its progress up to that point.
#
# Tom Christiansen
# tchrist@perl.com
# Sat Jan 29 10:41:09 MST 2011
use strict;
use warnings;
use encoding "Latin1";
use open IO => ":utf8";
use charnames ();
$| = 1;
my @legal;
my ($start, $stop) = (0, 0x1000);
if (@ARGV != 0) {
if (@ARGV == 1) {
for (($stop) = @ARGV) {
$_ = oct if /^0/; # support 0OCTAL, 0xHEX, 0bBINARY
}
}
elsif (@ARGV == 2) {
for (($start, $stop) = @ARGV) {
$_ = oct if /^0/;
}
}
else {
die "usage: $0 [ [start] stop ]\n";
}
}
for my $cp ( $start .. $stop ) {
my $char = chr($cp);
next if $char =~ /[\s\w]/;
my $type = "?";
for ($char) {
$type = "Letter" if /\pL/;
$type = "Mark" if /\pM/;
$type = "Number" if /\pN/;
$type = "Punctuation" if /\pP/;
$type = "Symbol" if /\pS/;
$type = "Separator" if /\pZ/;
$type = "Control" if /\pC/;
}
my $name = $cp ? (charnames::viacode($cp) || "<missing>") : "NULL";
next if $name eq "<missing>" && $cp > 0xFF;
my $msg = sprintf("U+%04X %s", $cp, $name);
print "testing \\p{$type} $msg...";
open(TESTPROGRAM, ">:utf8", "testchar.java") || die $!;
print TESTPROGRAM <<"End_of_Java_Program";
public class testchar {
public static void main(String argv[]) {
String var_$char = "variable name ends in $msg";
System.out.println(var_$char);
}
}
End_of_Java_Program
close(TESTPROGRAM) || die $!;
system q{
( javac -encoding UTF-8 testchar.java \
&& \
java -Dfile.encoding=UTF-8 testchar | grep variable \
) >/dev/null 2>&1
};
push @legal, sprintf("U+%04X", $cp) if $? == 0;
if ($? && $? < 128) {
print "<interrupted>\n";
exit; # from a ^C
}
printf "is %s in Java identifiers.\n",
($? == 0) ? uc "legal" : "forbidden";
}
END {
print "Legal but evil code points: @legal\n";
}
以下是仅在前三个代码点上运行该程序的示例,该代码点既不是空格也不是标识符字符:
$ perl test-java-idchars 0 0x20
testing \p{Control} U+0000 NULL...is LEGAL in Java identifiers.
testing \p{Control} U+0001 START OF HEADING...is LEGAL in Java identifiers.
testing \p{Control} U+0002 START OF TEXT...is LEGAL in Java identifiers.
testing \p{Control} U+0003 END OF TEXT...is LEGAL in Java identifiers.
testing \p{Control} U+0004 END OF TRANSMISSION...is LEGAL in Java identifiers.
testing \p{Control} U+0005 ENQUIRY...is LEGAL in Java identifiers.
testing \p{Control} U+0006 ACKNOWLEDGE...is LEGAL in Java identifiers.
testing \p{Control} U+0007 BELL...is LEGAL in Java identifiers.
testing \p{Control} U+0008 BACKSPACE...is LEGAL in Java identifiers.
testing \p{Control} U+000B LINE TABULATION...is forbidden in Java identifiers.
testing \p{Control} U+000E SHIFT OUT...is LEGAL in Java identifiers.
testing \p{Control} U+000F SHIFT IN...is LEGAL in Java identifiers.
testing \p{Control} U+0010 DATA LINK ESCAPE...is LEGAL in Java identifiers.
testing \p{Control} U+0011 DEVICE CONTROL ONE...is LEGAL in Java identifiers.
testing \p{Control} U+0012 DEVICE CONTROL TWO...is LEGAL in Java identifiers.
testing \p{Control} U+0013 DEVICE CONTROL THREE...is LEGAL in Java identifiers.
testing \p{Control} U+0014 DEVICE CONTROL FOUR...is LEGAL in Java identifiers.
testing \p{Control} U+0015 NEGATIVE ACKNOWLEDGE...is LEGAL in Java identifiers.
testing \p{Control} U+0016 SYNCHRONOUS IDLE...is LEGAL in Java identifiers.
testing \p{Control} U+0017 END OF TRANSMISSION BLOCK...is LEGAL in Java identifiers.
testing \p{Control} U+0018 CANCEL...is LEGAL in Java identifiers.
testing \p{Control} U+0019 END OF MEDIUM...is LEGAL in Java identifiers.
testing \p{Control} U+001A SUBSTITUTE...is LEGAL in Java identifiers.
testing \p{Control} U+001B ESCAPE...is LEGAL in Java identifiers.
testing \p{Control} U+001C INFORMATION SEPARATOR FOUR...is forbidden in Java identifiers.
testing \p{Control} U+001D INFORMATION SEPARATOR THREE...is forbidden in Java identifiers.
testing \p{Control} U+001E INFORMATION SEPARATOR TWO...is forbidden in Java identifiers.
testing \p{Control} U+001F INFORMATION SEPARATOR ONE...is forbidden in Java identifiers.
Legal but evil code points: U+0000 U+0001 U+0002 U+0003 U+0004 U+0005 U+0006 U+0007 U+0008 U+000E U+000F U+0010 U+0011 U+0012 U+0013 U+0014 U+0015 U+0016 U+0017 U+0018 U+0019 U+001A U+001B
这是另一个演示:
$ perl test-java-idchars 0x600 0x700 | grep -i legal
testing \p{Control} U+0600 ARABIC NUMBER SIGN...is LEGAL in Java identifiers.
testing \p{Control} U+0601 ARABIC SIGN SANAH...is LEGAL in Java identifiers.
testing \p{Control} U+0602 ARABIC FOOTNOTE MARKER...is LEGAL in Java identifiers.
testing \p{Control} U+0603 ARABIC SIGN SAFHA...is LEGAL in Java identifiers.
testing \p{Control} U+06DD ARABIC END OF AYAH...is LEGAL in Java identifiers.
Legal but evil code points: U+0600 U+0601 U+0602 U+0603 U+06DD
任何人都可以解释这个看似疯狂的行为吗?整个地方有很多很多其他令人费解的许可代码点,从U + 0000开始,这可能是最奇怪的。如果在第一个0x1000代码点上运行它,则会看到某些模式出现,例如允许使用属性Current_Symbol
的任何和所有代码点。但是太多其他事情是完全无法解释的,至少是我。
答案 0 :(得分:15)
Java Language Specification section 3.8推迟到Character.isJavaIdentifierStart()和Character.isJavaIdentifierPart()。后者在其他条件中具有Character.isIdentifierIgnorable(),它允许非空白控制字符(包括整个C1范围,请参阅列表的链接)。
答案 1 :(得分:8)
另一个问题可能是:为什么Java不允许在其标识符中使用控制字符?
设计语言或其他系统时的一个好原则是,在没有正当理由的情况下不要禁止任何事情,因为你永远不知道如何使用它,实施者和用户必须应对的规则越少越好。
确实,您当然不应该利用这一点,通过将转义实际嵌入到您的变量名中,并且您将看不到任何公开的类,其中包含带有空字符的类。
当然,这可能会被滥用,但是语言设计师的工作不是以这种方式保护程序员,而是通过强制适当的缩进或精心选择的变量名称。
答案 2 :(得分:-2)
我不知道什么是重要的。无论如何它对你有何影响?
如果开发人员想要模糊他的代码,他可以使用ASCII。
如果开发人员想让他的代码可以理解,他将使用该行业的通用语言:英语。不仅标识符只是ASCII,还有普通的英文单词。否则,没有人会使用或阅读他的代码,他可以使用他喜欢的任何疯狂角色。