bravetraveler an hour ago

Anything made because "we didn't have time"... when an educated participant wouldn't have entertained any of this to begin with

When the bugs near flaws... I'll burn it, you, and myself down

romanhn 6 hours ago

Once I had to track down an issue where very rarely, with no discernible pattern the web app would produce garbled PDFs. Turned out this happened when an admin account remotely connected to the app server, which caused a reset of default screen resolution, which messed up the PDF library that relied on a specific resolution (it was HTML-to-PDF conversion). Happened rarely and randomly because there were multiple web servers and they were occasionally restarted which would fix the problem until next time.

Another fun problem I dealt with was when I was moving my employer's codebase from Subversion to Mercurial version control ages ago. Everything looked good, except a directory named CVS (after the pharmacy, a customer) was missing. Was banging my head on the table before realizing that the default .hgignore file instructed Mercurial to ignore all contents of .*/CVS (another old version control system).

mikewarot 5 hours ago

1997 - Windows 98, HP 4000 printer drivers assumed you had the floating point libraries loaded into windows, and just dynamically unloaded them whenever they felt like it. So, everything would work fine, until you did something that had to compute the page dimensions (and thus use floating point).

Forcing the loading of the floating point libraries fixed it, but it took months to track it down.

It turns out it was an optimizing compiler that HP didn't properly set the options on, in their make setup.

  • ZevsVultAveHera 2 hours ago

    Probably wildest story involving loading/unloading libraries I have ever read about. How you noticed that? Had to resolve to memory debugging?

Terr_ 10 hours ago

Recycling a comment, where part of the annoyance came from the feeling that they should have been asking someone else to solve it: https://news.ycombinator.com/item?id=37859771

_____

[That's like] Me, with zero C/C++ experience, being asked to figure out why the newer version of the Linux kernel is randomly crash-panicking after getting cross-compiled for a custom hardware box.

("He's familiar with the the build-system scripts, so he can see what changed.")

-----

I spent weeks of testing slightly different code-versions, different compile settings, different kconfig options, knocking out particular drivers, waiting for recompiles and walking back and forth to reboot the machine, and generally puzzling over extremely obscure and shifting error traces... And guess what? The new kernel was fine.

What was not fine were some long-standing hexadecimal arguments to the hypervisor, which had been memory-corrupting a spot in all kernels we'd ever loaded. It just happened to be that the newer compiles shifted bytes around so that something very important was in the blast zone.

Anyway, that's how 3 weeks of frustrating work can turn into a 2-character change.

  • ZevsVultAveHera 2 hours ago

    Ah, joy of kernel debugging. Even with knowledge of C measured in years it could take weeks to debug trivial mistakes. Been there and seen others (with long careers in kernels) being there.

JoeAltmaier 10 hours ago

Combination PIT/serial interrupt issue involving microsecond-resolution system programmable interval timer and multi-port serial driver. Would crash every day or so.

Had to create a stress test to reproduce in minutes not days. Then trace code paths through timers and serial events to find problematical path. Turned out to have many - timer interrupt callback could cancel interrupt, reschedule timer, change interval, cancel then reschedule. All in the presence of other channel interrupts occurring and overlapping unpredictably. Timers rescheduled for intervals that had passed already once the callback completed. And on and on.

Took a weekend alone with the code and a set of machines, desk-time getting my head around it all, then coding bullet-proof paths for all calls and callbacks for every related system call.

Once it worked, it worked for days then months under test. Nothing is too hard to resist a methodical approach.

  • ZevsVultAveHera 2 hours ago

    Ah yes, one of "the funniest" problems. They teach you a lot or drive you insane. Have you ever wrote about this adventure in some form of an article or narrative story? Would be a great read, I'm sure.

mike_hearn 30 minutes ago

I used to work on Wine, first as a volunteer and later as a job, so spent a lot of time staring at gigantic multi-gigabyte sized logs trying to work out why an app was crashing when running on Linux. Sometimes apps would work fine for me but be reported as crashing by a user, or we wouldn't have access to the app at all, so logs were the only way to work out what was going wrong.

We got a bug report that an app would crash, and I couldn't reproduce it. So we asked the user, are you using the latest version of Wine from our website? "Yes I am". OK, that's odd, send us some logs then. The crash was some sort of memory corruption during startup of the app. Everything seemed to be running fine, the app was loading files and reading registry entries happily, and then suddenly it would segfault in a random place. No opportunity to debug directly, as everything was binary only and only crashing on this guy's machine.

I spent days working painstakingly through hundreds of millions of lines of API call traces, until eventually I found what seemed to be a difference between his logs and mine. In his logs, some registry reads were failing, and in mine they worked. But why?

It turned out that the guy had been lying to us. He hadn't actually installed the app using the Wine downloads from winehq.org, he'd installed it from the Debian repositories. The packages provided by Debian were badly broken: they had split various tools out into a separate -utils package which wasn't installed by default because that complied with Debian standards better. But that was an error because Windows doesn't care about Debian standards and those tools aren't optional there, so many programs assumed those tools were always available. One of them was regedit.exe, which this app's installer was running with some flags to add default registry entries. On Windows this would never fail, so the installer didn't check the error codes and the install failure was silent. And then the app didn't check the error codes when reading the entries either, because again, that would never fail on Windows. So the reads silently did nothing, the memory the app expected to be initialized wasn't, it tried to use it and corrupted its heap which then led to a random crash about a million API calls away. The original failure wasn't even in the logs I was looking at.

At the time we had an explicit policy of not supporting anyone who installed Wine from their distribution packages, exactly because of bugs like this. Instead the project provided its own apt repositories. The distro-centric model Linux used was just broken because it led to packagers who weren't a part of the upstream communities "fixing" software they didn't understand as they packaged it. The notorious SSH bug was another case of that but such stories are commonplace. Debian users in particular were hard to deal with because lots of community built packages was a part of the distro's appeal and moat, even though upstream developers often hated it (lots of obsolete bug reports or distro-created bugs). So they had become defensive, and some had taken to deceiving upstreams when filing bugs because they thought they knew better.

Needless to say, a multi-day memory corruption debugging session that ended with "there is no bug, follow the install instructions on our website and stop lying to us about it" was by far the most annoying bug I ever had to work on.

erdaniels 11 hours ago

Upgrading from QT4 to 5 broke the appending of QStrings to QByteArrays such that it stored half the data from a QString (some wonkiness with UTF8 and UTF16 IIRC). Took a rewrite of the RTMP/AMF layer in the codebase to figure it out.

billconan 11 hours ago

rendering corruption issue or perf issue of wayland that involves 100 processes.