Tuesday, August 23, 2016

Ghosts in the (Odin) Machine

Heh.

A tale of tears.

I use an old Win Xp craptop (x32, 2 GB RAM) for anything dodgy - for instance running Odin or any other code of "uncertain provenance".

I leave it air-gapped and have a ghost image backup so the whole OS can be nuked and put back to square zero. (see "ntfsclone") Anything I need for it is just sneaker-netted.

So I'm trying to use it with Odin the other day, and I get irreproducible errors. With pure stock flashes, sometimes success, sometimes failures. Sometimes in the MD5 check in Odin, sometimes an "Auth Fail" on the phone.

WTF?

So I start doing MD5 checks manually. OK, bad checksums, there's the trouble. MD5s are OK on the sneaker-net USB stick, but sometimes not OK on the craptop HDD. No hardware complaints in the Event manager.

I temporarily conclude I have a dodgy USB port.

Use a different port, recopy all files. Check MD5s. All OK. Problem solved, right?

Run the MD5 check using Odin on 8 different stock firmwares (2.5 GB each, this is slow work). One of eight is bad. What? No event log hardware troubles evident.

Re-check MD5 on the bad one; it's correct. WHAT?

Out of frustration, I write a script that repetitively loops over all eight blobs, computing MD5 values and comparing to past results. Let it run for 50 loops: 8 * 2.5 * 50 = 1 TB of data reads. No Errors. WHAT?

Now I let the script run and let Odin also do a MD5 check, making sure that both Odin and the (cygwin) md5sum proggie are simultaneously reading the same file.

They both fail their checks. SERIOUSLY? Independent **read** operations interfering with each other? WTF?

So finally I do what should have been done hours earlier: I reboot craptop lappy.

And it POSTs with a memory error at 0x00035648CE4 - approx 854 MB.

Ahhhh, it now all makes sense: the erratic nature of the problem depended on whether the file data traversed through read cache in the affected memory area. The files themselves are bigger than physical memory, but the exact pattern of memory usage depends on activity on the laptop. One checker running reads, read cache usage is one pattern; two running reads and it's a different pattern.

But that's not the end of the story, oh no!

I remember that craptop lappy has two SO-DIMMs of 1 GB each. One is under a door in the back, one is under the keyboard. Some disassembly required!

The idea is this: that memory error is in the first stick (854 out of 1024 MB). If I swap the sticks, the memory error will move to 1876 MB. So long as the BIOS catches the error and "shortens" memory, I'll have a 1.8GB craptop. If BIOS won't reliably detect the problem, I'll chuck the second SO-DIMM, and have a 1GB WinXp craptop.

Before tearing anything down, I bust out an old copy of Knoppix (it has memtest86+ on it as an alternate boot), boot it up, and verify that yeah verily I seem to have a hard memory fault at that exact address reported by BIOS - 0x00035648CE4.

All things considered, it could be worse (e.g. massive random errors all over the place). At least it's only in a single fixed location, right?

So I tear down craptop lappy and swap the two SO-DIMMs; reassemble and boot memtest 86+, and get an error at a single location.

0x00035648CE4

AWWWWW Damn.

I don't have a memory problem; I have an address-pattern-dependant fault in the memory controller (rare), or a pattern-dependant fault on the physical bus (almost impossible)

I guess craptop lappy is headed for the dustbin.

Still to check - will lappy boot with a SO-DIMM missing from the first slot? Will the fault still occur in the memory controller even if that is possible?

Hope you enjoyed the read. (Misery loves company)


from xda-developers http://ift.tt/2bDzWsB
via IFTTT

No comments:

Post a Comment