Does a compiler always produce an assembly code?

TL:DR different object file formats / easier portability to new Unix platforms (historically) is one of the main reasons for gcc keeping the assembler separate from the compiler, I think. Outside of gcc, the mainstream x86 C and C++ compilers (clang/LLVM, MSVC, ICC) go straight to machine code, with the option of printing asm text if you ask them to.

LLVM and MSVC are / come with complete toolchains, not just compilers. (Also come with assembler and linker). LLVM already has object-file handling as a library function, so it can use that instead of writing out asm text to feed to a separate program.

Smaller projects often choose to leave object-file format details to the assembler. e.g. FreePascal can go straight to an object file on a few of its target platforms, but otherwise only to asm. There are many claims (1, 2, 3, 4) that almost all compilers go through asm text, but that’s not true for many of the biggest most-widely-used compilers (except GCC) that have lots of developers working on them.

C compilers tend to either target a single platform only (like a vendor’s compiler for a microcontroller) and were written as “the/a C implementation for this platform”, or be very large projects like LLVM where including machine code generation isn’t a big fraction of the compiler’s own code size. Compilers for less widely used languages are more usually portable, but without wanting to write their own machine-code / object-file handling. (Many compilers these days are front-ends for LLVM, so get .o output for free, like rustc, but older compilers didn’t have that option.)

Out of all compilers ever, most do go to asm. But if you weight by how often each one is used every day, going straight to a relocatable object file (.o / .obj) is significant fraction of the total builds done on any given day worldwide. i.e. the compiler you care about if you’re reading this might well work this way.

Also, compilers like javac that target a portable bytecode format have less reason to use asm; the same output file and bytecode format work across every platform they have to run on.

https://retrocomputing.stackexchange.com/questions/14927/when-and-why-did-high-level-language-compilers-start-targeting-assembly-language on retrocomputing has some other answers about advantages of keeping as separate.
What is the need to generate ASM code in gcc, g++
What do C and Assembler actually compile to? – even compilers that go straight to machine code don’t produce linked executables directly, they produce relocatable object files (.o or .obj). Except for tcc, the Tiny C Compiler, intended for use on the fly for one-file C programs.
Semi-related: Why do we even need assembler when we have compiler? asm is useful for humans to look at machine code, not as a necessary part of C -> machine code.

Why GCC does what it does

Yes, as is a separate program that the gcc front-end actually runs separately from cc1 (the C preprocessor+compiler that produces text asm).

This makes gcc slightly more modular, making the compiler itself a text -> text program.

GCC internally uses some binary data structures for GIMPLE and RTL internal representations, but it doesn’t write (text representations of) those IR formats to files unless you use a special option for debugging.

So why stop at assembly? This means GCC doesn’t need to know about different object file formats for the same target. For example, different x86-64 OSes use ELF, PE/COFF, MachO64 object files, and historically a.out. as assembles the same text asm into the same machine code surrounded by different object file metadata on different targets. (There are minor differences gcc has to know about, like whether to prepend an _ to symbol names or not, and whether 32-bit absolute addresses can be used, and whether code has to be PIC.)

Any platform-specific quirks can be left to GNU binutils as (aka GAS), or gcc can use the vendor-supplied assembler that comes with a system.

Historically, there were many different Unix systems with different CPUs, or especially the same CPU but different quirks in their object file formats. And more importantly, a fairly compatible set of assembler directives like .globl main, .asciiz "Hello World!\n", and similar. GAS syntax comes from Unix assemblers.

It really was possible in the past to port GCC to a new Unix platform without porting as, just using the assembler that comes with the OS.

Nobody has ever gotten around to integrating an assembler as a library into GCC’s cc1 compiler. That’s been done for the C preprocessor (which historically was also done in a separate process), but not the assembler.

Most other compilers do produce object files directly from the compiler, without a text asm temporary file / pipe. Often because the compiler was only designed for one or a couple targets, like MSVC or ICC or various compilers that started out as x86-only, or many vendor-supplied compilers for embedded chips.

clang/LLVM was designed much more recently than GCC. It was designed to work as an optimizing JIT back-end, so it needed a built-in assembler to make it fast to generate machine code. To work as an ahead-of-time compiler, adding support for different object-file formats was presumably a minor thing since the internal software architecture was there to go straight to binary machine code.

LLVM of course uses LLVM-IR internally for target-independent optimizations before looking for back-end-specific optimizations, but again it only writes out this format as text if you ask it to.

Why GCC does what it does

More Related Contents:

Leave a Comment Cancel reply