Question

我有一个简单的hello world objective-c lib：

hello.m：

#import <Foundation/Foundation.h>
#import "hello.h"

void sayHello()
{
    #ifdef FRENCH
    NSString *helloWorld = @"Hello World!\n";
    #else
    NSString *helloWorld = @"Bonjour Monde!\n";
    #endif
    NSFileHandle *stdout = [NSFileHandle fileHandleWithStandardOutput];
    NSData *strData = [helloWorld dataUsingEncoding: NSASCIIStringEncoding];
    [stdout writeData: strData];
}

hello.h文件如下所示：

int main (int argc, const char * argv[]);
int sum(int a, int b);
void sayHello();

使用clang和gcc在osx和linux上编译就好了。

现在我的问题：

当在ubuntu上使用clang多次对hello.m运行一个干净的编译时，生成的hello.o会有所不同。这似乎与时间戳无关，因为即使在一秒或更长时间之后，生成的.o文件也可以具有相同的校验和。从我天真的角度来看，这似乎是一个完全随机/不可预测的行为。

我使用-S运行编译以检查生成的汇编代码。汇编代码也不同（如预期的那样）。可以在此处找到比较汇编程序代码的diff文件：http://pastebin.com/uY1LERGX

从第一眼看，它看起来就像汇编代码中的排序不同。

使用gcc编译时不会发生这种情况。

有没有办法告诉clang生成与gcc完全相同的.o文件？

clang --version: 
Ubuntu clang version 3.0-6ubuntu3 (tags/RELEASE_30/final) (based on LLVM 3.0)

Answer 1

编译器始终生成相同代码时的功能称为Reproducible Builds或确定性编译。

编译器输出不稳定的可能来源之一是ASLR（Address space layout randomization）。有时编译器或它使用的某些库可能会读取对象地址并使用它们，例如作为哈希或映射的键;或者根据地址对对象进行排序。当编译器在哈希上进行迭代时，它将按照依赖于对象地址的顺序读取对象，ASLR将以不同的顺序放置对象。这样的效果看起来像是你重新排序的符号（your diffs中的.quads）

您可以使用echo 0 | sudo tee /proc/sys/kernel/randomize_va_space全局禁用Linux ASLR。在Linux中禁用ASLR的本地方式是

 setarch `uname -m` -R /bin/bash`

setarch的手册页说：-R, "--addr-no-randomize" Disables randomization of the virtual address space (turns on ADDR_NO_RANDOMIZE).

对于OS X 10.6，有DYLD_NO_PIE个环境变量（检查man dyld，可能在bash export DYLD_NO_PIE=1中使用）;在10.7和更新版本中，有--no_pie构建标志用于构建LLVM本身，或者在启动llvm之前设置应在_POSIX_SPAWN_DISABLE_ASLR中使用的posix_spawnattr_setflags;或者在10.7+中使用带有--no-pie选项的脚本http://src.chromium.org/viewvc/chrome/trunk/src/build/mac/change_mach_o_flags.py来清除llvm二进制文件（thanks to asan people）中的PIE标志。

clang和llvm中存在一些错误，这些错误阻止/阻止它们完全确定性，例如：

[cfe-dev] clang: not deterministic anymore? - 2009年11月3日，检测到来自LLVM bug 5355的代码的不确定性。作者说indeterminism was present only with -g option enabled
[LLVMdev] Deterministic code generation and llvm::Iterators（2010）
[llvm-commits] Fix some TableGen non-deterministic behavior.（2012年9月）
r196520 - Fix non-deterministic behavior. - 仅在2013年12月5日将SLPVectorizer修复为确定性（用VectorSet替换SmallSet）
190793 - TableGen: give asm match classes deterministic order.“TableGen通过指针对一些内部数据结构中的条目进行排序。” - 2013年9月16日
LLVM bug 14901就是这种情况。

来自14901的补丁包含关于llvm :: DenseMap的非确定性迭代的注释：

-  typedef llvm::DenseMap<const VarDecl *, std::pair<UsesVec*, bool> > UsesMap;
+  typedef std::pair<UsesVec*, bool> MappedType;
+  // Prefer using MapVector to DenseMap, so that iteration order will be
+  // the same as insertion order. This is needed to obtain a deterministic
+  // order of diagnostics when calling flushDiagnostics().
+  typedef llvm::MapVector<const VarDecl *, MappedType> UsesMap;
...
-    // FIXME: This iteration order, and thus the resulting diagnostic order,
-    //        is nondeterministic.

LLVM的文档说，有几个内部容器的非确定性和确定性变体，如Map vs MapVector：trunk/docs/ProgrammersManual.rst：

1164    The difference between SetVector and other sets is that the order of iteration
1165    is guaranteed to match the order of insertion into the SetVector.  This property
1166    is really important for things like sets of pointers.  Because pointer values
1167    are non-deterministic (e.g. vary across runs of the program on different
1168    machines), iterating over the pointers in the set will not be in a well-defined
1169    order.
1170    
1171    The drawback of SetVector is that it requires twice as much space as a normal
1172    set and has the sum of constant factors from the set-like container and the
1173    sequential container that it uses.  Use it **only** if you need to iterate over
1174    the elements in a deterministic order. 

...

1277    StringMap iteratation order, however, is not guaranteed to be deterministic, so
1278    any uses which require that should instead use a std::map.
...

1364    ``MapVector<KeyT,ValueT>`` provides a subset of the DenseMap interface.  The
1365    main difference is that the iteration order is guaranteed to be the insertion
1366    order, making it an easy (but somewhat expensive) solution for non-deterministic
1367    iteration over maps of pointers.

LLVM的一些作者可能认为在他们的代码中没有必要以迭代顺序保存确定性。例如，ARMTargetStreamer中有关于MapVector（ARMTargetStreamer.cpp - class AssemblerConstantPools）的ConstantPools用法的注释。但是我们如何确定像DenseMap这样的非确定性容器的所有用法都不会影响编译器的输出？在DenseMap上有几十个循环迭代："DenseMap.*const_iterator" regex in codesearch.debian.net

您的LLVM和clang版本（3.0，来自 2011 -11-30）显然太旧了，从2012年和2013年开始，所有确定性都有所增强（有些列在我的答案中）。您应该更新您的LLVM和Clang，然后重新检查您的程序以进行确定性编译，然后在更短且更容易重现的示例中找到非确定性（例如从中间阶段保存bc - bitcode），然后您可以在LLVM bugzilla中发布错误。

Answer 2

在编译源代码时尝试使用clang和gcc的-S选项。这将生成一个.s文件，您可以在其中查看汇编代码，这可以让您了解较低级别的差异。也许您会意识到输出将是相同的，并且您的问题从编译器进一步转移到链接器。

Answer 3

您应该将此报告为错误;一个编译器当然应该是确定性的。

根据我的经验，您对排序顺序的猜测很可能是正确的。最有可能的是，当两个项目相等时，编译器会做出任意决定（根据任何重要措施;它们实际上不必相同），并且可能会因环境因素而异。我之前在GCC中看到过这种情况，其中针对不同主机操作系统编译的相同编译器产生了不同的结果;在这种情况下，事实证明Windows qsort功能与Linux（glibc）实现略有不同。

那就是说，它可能是别的东西;编译器不应该做出随机决策，但是有很多机会可能会出现不稳定的任意决策（某种程度上是地址空间随机化）？

clang编译器从相同的源生成不同的目标文件

3 个答案: