Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What every compiler writer should know about programmers (2015) [pdf] (tuwien.ac.at)
62 points by tosh 6 hours ago | hide | past | favorite | 39 comments
 help



In the end it all boils to a very simple argument. The C programmers want the C compilers to behave one way, the C implementers want the C compilers to behave the other way. Since the power structure is what it is — the C implementers are the ones who write the C standard and are the ones who actually get to implement the C compilers — the C compilers do, and will, behave the way the C implementers want them to.

In this situation the C programmers can either a) accept that they're programming in a language that exists as it exists, not as they'd like it to exist; b) angrily deny a); or c) switch to some other system-level language with defined semantics.


Given what most C compilers are written in, are C programmers also C implementers?

I suspect it also depends on who exactly the compiler writers are; the GCC and LLVM guys seem to have more theoretics/academics and thus think of the language more abstractly, leading to UB being truly inexplicable and free of thought, while MSVC and ICC are more on the practical side and their interpretation of it is, as the standard says, "in a documented manner characteristic of the environment". IMHO the "spirit of C" and the more commonsense approach is definitely the latter, and K&R themselves have always leaned in that direction. This is very much a "letter of the law vs. spirit of the law" argument. The fact that these two different sides have produced compilers with nearly the same performance characteristics shows IMHO that the argument of needing to exploit UB is mandatory for performance is a debunked myth.


Another alternative is that the programmer write their own C compiler and be free of this politics. Maybe I am biased since I am working on exactly such a project, but I have been seeing more and more in-progress compiler implementations for C or C-like languages for the past couple years.

And please provide feedback to WG14. Also please give feedback and file bugs for GCC / clang. There are users of C in the committee and we need your support. Also keeping C implementable for small teams is something that is at risk.

> behave the way the C implementers want them to

If you don't please your users, you won't have any users.


It's ironic that I have to tell you of all people this, but many users of C (or at least, backends of compilers targeted by C) do actually want the compiler to aggressively optimize around UB.

If you're self hosting your compiler on C, you are your own user.

Which users?

And yet, C++.

> And yet, C++.

By any metric, C++ is one of the most successful programming languages devised by mankind, if not the most successful.

What point were you trying to make?


How about we agree on the ABI and everyone can have their own C compiler. Everyone C's the world through their own lenses.

We're not too far away from that. At the very least, Claude can provide feedback and help decide which compiler options to use, as per developer preference.

Compiler developers hijacked and twisted the term "Undefined Behavior". Everyone understood what UB was in K&R C - if you write code that the standard doesn't define a meaning to, and the compiler outputs what it outputs. If you dereference a null pointer, the compiler outputs a null pointer dereference, and when you hit it at runtime you get the undefined behavior (page fault on modern systems).

Nowadays, UB means something completely different - if at any point in time, the compiler reasons out that a piece of code is only reachable via UB, it will assume that this can never happen, and will quietly delete everything downstream:

https://godbolt.org/z/EYxWqcfjx


Sorry if I’m missing something as this isn’t my field, but shouldn’t the two meanings be roughly equivalent to the user?

As in, everything down from UB is only working by an accident of implementation that does not need to hold, and you should explicitly not rely on that. Whether the compiler happens to explicitly make it not ever work or just leaves it to fate should not be relevant.



Thanks! Macroexpanded:

What every compiler writer should know about programmers (2015) [pdf] - https://news.ycombinator.com/item?id=19659555 - April 2019 (62 comments)

What every compiler writer should know about programmers [pdf] - https://news.ycombinator.com/item?id=11219874 - March 2016 (106 comments)


Note that all the examples come from lack of bounds checking.

This was 2015, and we still have no -Wdeadcode, warning of removal of "dead code", ie what compilers think of dead code. If a program writer writes code, it is never dead. It is written. It had purpose. If the compiler thinks this is wrong, it needs to warn about it.

The only dead code is generated code by macros.


Dead code is extremely common in C or C++ after inlining, other optimizations.

OP means that the code has a dual purpose: one purpose is to be compiled, the other is to communicate structure or intent to programmers.

Do we know that? I've written "dead" code. It's point was to communicate structure or intent, but it was also still dead. This pattern, in one form or another, crops up a lot IME (in multiple languages, even, with varying abilities to optimize it):

  if condition that is "always" false:
    abort with message detailing the circumstances
That `if` is "dead", in the sense that the condition is always false. But "dead" sometimes is just a proof — or if I'm not rigourous enough, an assumption — in my head. If the compiler can prove the same proof I have in my head, then the dead code is eliminated. If can't, well, presumably it is left in the binary, either to never be executed, or to be executed in the case that the proof in my head is wrong.

Some conditions depend strictly on inputs and the compiler can't reason much about them, and the developers can't be sure about what their users will do. So that pattern is common. It's a sibling of assertions.

There are even languages with mandatory else branch.


Or stubs. I'll often flesh out a class before implementing the methods.

That's the problem

Why is that a problem? Inlining and optimization aren't minor aspects of compiling to native code, they are responsible for order-of-magnitude speedups.

My point is that it is easy to say "don't remove my code" while looking at a simple single-function example, but in actual compilation huge portions of a function are "dead" after inlining, constant propagation and other optimizations: not talking anything about C-specific UB or other shenanigans. You don't want to throw that out.


That's not true for all code bases. Two common examples:

It's very common for inline functions in headers to be written for inlining and constant propagation from arguments result in dead code and better generated code. There is even __builtin_constant_p() to help with such things (e.g., you can use it to have a fast folded inline variant if an argument is constant, or call big out of line library code if variable).

There are also configuration systems that end up with config options in headers that code tests with if (CONFIG_BLAH) {...} that can evaluate to zero in valid builds.


Here's a cogent argument that any decision by compiler writers that they can do whatever they wish whenever they encounter an "undefined behavior" construct is rubbish:

https://www.yodaiken.com/2021/05/19/undefined-behavior-in-c-...

And here's a cautionary tale of how a compiler writer doing whatever they wish once they encounter undefined behavior makes debugging intractable:

https://www.quora.com/What-is-the-most-subtle-bug-you-have-h...


> undefined behavior makes debugging intractable:

By their own admission, the compiler warns about the UB. "-Wanal"¹, as some call it, makes it an error. Under UBSan the program aborts with:

  code.cpp:4:6: runtime error: execution reached the end of a value-returning function without returning a value
… "intractable"?

¹a humorous name for -Wextra -Wall -Werror


Wow, that's a very torturous reading of a specific line in a standard. And it doesn't really matter what Yodaiken thinks this line means because standard is written by C implementers for (mostly) C implementers. So if C compile writers think this line means they can use UB for optimizing purposes, then that's what it means.

Yeah, I know it breaks the common illusion among the C programmers that they're "close to the bare metal", but illusions should be dispersed, not indulged. The C programmers program for the abstract C machine which is then mediated by the C compilers into machine code the way the implementers of C compilers publicly documented.


Yeah, this is basically Sovereign Citizen-tier argumentation: through some magic of definitions and historical readings and arguing about commas, I prove that actually everyone is incorrect. That's not how programming languages work! If everyone for 10+ years has been developing compilers with some definition of undefined behavior, and all modern compilers use undefined behavior in order to drive optimization passes which depend on those invariants, there is no possible way to argue that they're wrong and you know the One True C Programming Language interpretation instead.

Moreover, compiler authors don't just go out maliciously trying to ruin programs through finding more and more torturous undefined behavior for fun: the vast majority of undefined behavior in C are things that if a compiler wasn't able to assume were upheld by the programmer would inhibit trivial optimizations that the programmer also expects the compiler to be able to do.


I find where the argument gets lost is when undefined behavior is assumed to be exactly that, an invariant.

That is to say, I find "could not happen" the most bizarre reading to make when optimizing around undefined behavior "whatever the machine does" makes sense, as does "we don't know". But "could not happen???" if it could not happen the spec would have said "could not happen" instead the spec does not know what will happen and so punts on the outcome, knowing full well that it will happen all the time.

The problem is that there is no optimization to make around "whatever the hardware does" or "we have no clue" so the incentive is to choose the worst possible reading "undefined behavior is incorrect code and therefore a correct program will never have it".


Some behaviors are left unspecified instead of undefined, which allows each implementation to choose whatever behavior is convenient, such as, as you put it, whatever the hardware does. IIRC this is the case in C for modulo with both negative operands.

I would imagine that the standard writers choose one or the other depending on whether the behavior is useful for optimizations. There's also the matter that if a behavior is currently undefined, it's easy to later on make it unspecified or specified, while if a behavior is unspecified it's more difficult to make it undefined, because you don't know how much code is depending on that behavior.


Aliasing being the classic example. If code generation for every pointer dereference has to assume that it’s potentially aliasing any other value in scope, things get slow in a hurry.

Compiler writers are free to make whatever intentional choices they want and document them. UB is especially nasty compared to other kinds of bugs because implementors can't/refuse to commit to any specific behavior, not because they've chosen the wrong behaviors.

Making C compilers better and more predictable is impossible with so many UB cases listed in the standard. A better language should be used instead, where UB and implementation-defined behavior cases are minimized.

Where there is UB in the standard it means that a C compiler is free to define the behavior. So of course, somebody could write a C implementation which does this. See also Fil-C for a perfectly memory safe version of C. So the first sentence makes no sense.

But also note that there is an ongoing effort to remove UB from the standard. We have eliminated already about 30% of UB in the core language for the upcoming version C2Y.


Have you seen Rust? I'm loving it.

Rust is not super appealing to me as C user: too complex, slow compilation, etc.

Maybe Zig, Hare or C3 then?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: