Two noob questions: What are the technical reasons that glibc cannot adhere to t...

cesarb · on Nov 14, 2018

> What are the technical reasons that glibc cannot adhere to the Linux dogma, "Don't break userspace?"

I don't know if they have something like that officially, but in practice, they do follow it. Programs linked to an older version of glibc continue working with a newer version of glibc, in a large part thanks to symbol versioning, which allows them to keep the old versions of an interface available to old binaries, while new binaries get the new functionality.

> If they screw up and make a bad interface just modify it and bump the version number.

Bumping the glibc version number would mean recompiling everything (a program can't have two versions of glibc at the same time, so all libraries a program links to would have to be recompiled); we had that in the libc5 to libc6 transition last century. And since they won't bump the version number, it means they will have to keep the bad interface forever, even if it's just visible to binaries compiled against an older glibc.

For a recent example in which they actually went ahead and removed a bad interface: https://lwn.net/Articles/673724/ and https://sourceware.org/bugzilla/show_bug.cgi?id=19473 -- and according to the later, they did it in a way which still kept existing binaries working.

jancsika · on Nov 14, 2018

> I don't know if they have something like that officially, but in practice, they do follow it.

If that's true then I don't understand ldarby's comment on the article:

> The common problem that I suspect Cyberax is actually moaning about is if software uses other calls like memcpy() which on centos 7 gets a version of GLIBC_2.14:

> readelf -a foo | grep memcpy 000000601020 000300000007 R_X86_64_JUMP_SLO 0000000000000000 memcpy@GLIBC_2.14 + 0 3: 0000000000000000 0 FUNC GLOBAL DEFAULT UND memcpy@GLIBC_2.14 (3) 55: 0000000000000000 0 FUNC GLOBAL DEFAULT UND memcpy@@GLIBC_2.14

> and this doesn't work on centos 6:

> ldd ./foo > ./foo: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./foo)

Just to be clear, my original question is why glibc technically cannot follow the same exact development model of the Linux kernel for retaining backward compatibility.

cesarb · on Nov 14, 2018

It's backwards compatible: software compiled on centos 6 will be using the older memcpy@GLIBC_2.2.5 symbol, which still exists on newer glibc together with the current memcpy@GLIBC_2.14 symbol, so it will work. What doesn't work is compiling with a newer glibc and expecting it to work on an older system.

The example above is actually a great example of bending over backwards to keep compatibility with broken userspace. Some programs incorrectly called memcpy with overlapping inputs, and an optimized version of memcpy started breaking these programs. Instead of just letting them break, the older symbol was kept with a slower implementation which accepts overlapping inputs, while new programs get the faster implementation at the memcpy@GLIBC_2.14 symbol.

rkeene2 · on Nov 14, 2018

glibc could follow the same model, but it would be wasteful for them to do so, without symbol versioning. With symbol versioning they DO. The memcpy(3) example is why. An old implementation of memcpy(3) supplied in older versions of glibc happened to not care if the memory regions overlapped. The interface for memcpy(3) said that they may not (or implementation defined behaviour occurred).

On some platforms, new instructions came out and allowed the glibc maintainers to write a faster version of memcpy(3) that still satisfied its documented interface, however it did not retain the undocumented behavior of allowing overlapping memory ranges.

So without symbol versioning we are given the options to either: 1. All be stuck with a slow memcpy(3) forever; or 2. break glibc user's code

Neither of those options were great. So the glibc maintainers decided to write a new function, let's call it "memcpy_fast()"[0]. But how do you get everyone to use it ? Symbol versioning is the answer here. At compile-time linking, there is a directive that tells the compiler that the current implementation of memcpy(3) is "memcpy_fast()", and that's the symbol that gets embedded into the executable's symbol table. If your code wasn't compiled against a glibc where this was available, you'd be using an older implementation. This gets you the best of both worlds: 1. Existing binaries continue to work (without the upgraded code path); 2. Newly produced binaries use the upgraded code path, and in theory are tested to ensure that they are working.

This does prevent executables from being compiled against a newer glibc than intended to be executed against... but, so what ? Linux ALSO doesn't guarantee that you that newer call semantics are available in older systems. The solution here is to either specifically indicate that you want the unversioned symbol, or compile against the lowest version of everything you wish to support. glibc is far from your biggest problem here given the ABI stability of many libraries.

Non-solutions: 1. Use macros: Side-effects 2. Just expose a new, unversioned symbol: Nobody will use it, you'll have to document it, it'll be a platform-specific call. If people do use it, then their binaries can't be used on older platforms (just like symbol versioning)

[0] The symbol is referred to as memcpy@@GLIBC_2.14

nwmcsween · on Nov 14, 2018

That's not how symbols work sans symbol versioning, you see c doesn't mangle symbols thus function foo on version 1 is the same as function foo on version 2 even with a completely different signature.

Annatar · on Nov 14, 2018

There are no technical reasons. The issue is that the kernel and glibc are not one coherent whole because they were always developed by two unrelated groups. glibc wasn't even designed for Linux, but for the GNU kernel.

The end result is that the users of GNU/Linux will always draw the short end of the stick.

jancsika · on Nov 14, 2018

> The issue is that the kernel and glibc are not one coherent whole because they were always developed by two unrelated groups.

But AFAICT the glibc dogma is based on the premise that it would be impossible for a large, complex project to have backward compatibility without making regular changes to the extant interfaces that it provides. Given that premise glibc devs seem to have some process for figuring out what "correctness" means for time=now and then noodle around with their interface to reflect that correctness in the next version of the lib. Thus symbol versioning is employed.

At the same time, Linux is a large, complex project with backward compatibility which does significantly less noodling around with the extant interface. AFAICT the process consists mainly of a) devs breaking the extant interface for correctness, b) a user submitting a bug, and c) the lead dev surrounding the declarative sentence "We don't break userspace" with imperative sentences containing curses and then rejecting the change.

I've read where Linus and others have tried to defend their choice and argued that the glibc dev process is worse. Regardless of the persuasiveness of that argument, I've read it and am familiar with it.

I am not familiar with the glibc argument as to why they require regular interface changes, nor an acknowledgement that a closely connected large complex project gets by without that. I don't see anything on the glibc FAQ about it-- only a question about symbol versioning where the answer assumes that the interface must change.

Annatar · on Nov 18, 2018

But AFAICT the glibc dogma is based on the premise that it would be impossible for a large, complex project to have backward compatibility without making regular changes to the extant interfaces that it provides.

Yeah well tell that to the engineers of HP-UX, IRIX, and Solaris, because all of those managed to produce libc's which were backward compatible. Sun Microsystems even legally warranted Solaris and therefore libc, they were that paranoid about backwards-compatibility.

That's not the issue. The issue is that glibc is developed by people who are not and never were system engineers and instead of learning from the masters, asking them how to do it correctly, or sticking with BSD when its situation was dire, they just decided to re-invent the wheel.

One does not simply re-invent glibc from first principles, especially so if one does not have the requisite insights and experience, which they didn't and they still don't, and most likely if they haven't by now, never will. GNU developers are a lost cause. Just look at how long it took them to "discover" versioned interfaces with linker map files, something Solaris system engineers have been using since the early '90's of the past century, and everything becomes crystal clear, if one knows the Red flags. That's one Red flag right there, "late in phase and unlikely to ever catch up".

cesarb · on Nov 14, 2018

> At the same time, Linux is a large, complex project with backward compatibility which does significantly less noodling around with the extant interface.

Take a look at the system call table:

    #define __NR_oldstat 18
    #define __NR_oldfstat 28
    #define __NR_oldolduname 59
    #define __NR_oldlstat 84
    #define __NR_olduname 109
    [...]
    #define __NR_dup3 330
    #define __NR_pipe2 331
    #define __NR_preadv2 378
    #define __NR_pwritev2 379

(And that's before the impending Y2038 changes to the API.)

The main difference is that, instead of defining a new symbol, a new system call number is defined. The effect is similar: a program using the new "stat" system call (106) won't work on an older kernel which doesn't have it, while on the opposite direction it still works (new kernels still understand the old system call).

One thing the kernel developers do nowadays to reduce the API churn is to add a flags argument to every new system call (for instance, the "dup3" above is the same as "dup2", but with a flags argument). Even then, if you try to use a flag which the current kernel doesn't know, it won't work (the kernel developers learned the hard way that you can't ignore unknown flags, since programs will pass them and then break on newer kernels).

And that's without considering the "escape hatches" of ioctl() and fcntl(), or the virtual filesystems like /proc and /sys, which are also part of the Linux kernel API. So yes, the Linux kernel does see regular interface changes.

jancsika · on Nov 14, 2018

Still trying to get my bearings, so bear with me...

Here are two different types of backward compatibility:

1. will old binary work with the new version?

2. will old code build and run correctly with the new version?

So when I talk about extant interface changes, I'm speculating that old code that leverages the public Linux interface is more likely to work and work correctly vs. old code that leverages the public glibc interface.

For example: suppose foodev built a Linux driver for a very popular piece of hardware in 2003, abandoned it, and in 2018 there are problems getting it to run correctly. Are those problems more likely due to Linux public interface churn or glibc public interface churn?

cesarb · on Nov 15, 2018

For a driver, there would be problems getting it to run correctly in 2004 already, since the internal Linux kernel API used by drivers is not stable and changes very rapidly. Unless the driver has been upstreamed, because developers updating the kernel internal APIs also update all the in-tree drivers at the same time.

About leveraging one interface or the other: when you use the glibc interface, you are also using the Linux interface behind it, so a change in either can affect your program. On the other hand, if you are using the Linux interface directly instead of going through the C library, chances are you are doing something unusual, which increases the risk of it breaking by accident. And there are some things which exist only on the glibc interface, like nameserver lookups (getaddrinfo), user database lookups (getpwnam/getpwuid), and many more.

Annatar · on Nov 15, 2018

"For a driver, there would be problems getting it to run correctly in 2004 already,"

This is unthinkable on Solaris / illumos kernels because of DDI / DDK interfaces: I can take a driver from 1993 for Solaris 2.5.1 and modload(1M) it into the latest nightly illumos build and I'm guaranteed that it will work.