Troubleshooting PC Hardware & ASUS Support

I bought an ASUS ROG STRIX 2080 TI as a Christmas present for myself in December of 2019, it was defective the moment I took it out of the box. Almost a year later, I finally have a working version of the GPU. This article details the problems I see in the ASUS support/repair center and also dives into some of the testing methodologies I used to determine if components were good/bad.

I’ve bought, requisitioned, and recommended a lot of ASUS equipment in the last 25 years, from motherboards to netbooks, and have always been happy with the brand. This is the first ASUS GPU I’ve purchased for myself and is also my first luxury GPU purchase. I regret both decisions in hindsight but live and learn as they say.

The Timeline

Here is the full timeline of this saga:

  • 12/27/2019 – Bought ASUS ROG Strix 2080 TI (GPU)
  • 12/30/2019 – Receive and install GPU. Most games crash after an hour.
  • 07/22/2020 – Open RMA for GPU.
  • 08/10/2020 – Receive bad replacement GPU. Open new RMA.
  • 10/23/2020 – Receive working replacement GPU

My first mistake was not filing for an RMA sooner. I waited for 7 months because I saw so many identical problem reports on the internet. I really believed that it was a driver issue. Case and point is the recent GTX 30XX release, cards were crashing on day one and a driver update fixed it all by underclocking a couple of MHz.

ASUS Repair Center

I have two gripes with ASUS, the primary one is that they seem incapable of identifying a bad GPU. From my experience, one GPU I received came off the store shelf while the other two came from a repair center.

There seem to be two unique entities at ASUS, the Support center and the Repair center and no information seems to flow between these two entities. When I sent my two RMAs in, I sent detailed information on how to reproduce the problem. Either this information was not sent to the repair center or the repair center ignored it. As a result, the repair center sent me a bad GPU.

I even sent an email to ASUS Support, asking them to make sure that the repair center ran the test that I had outlined, both on my bad card AND on the card they were about to send me. But they replied that they could not contact the repair center and that I would have to wait until I received my RMA unit to see what repair steps were performed.

The following message accompanied both “repaired” units to indicate the extent of the testing and “repairing” done.

A clear sign that the repair center did not reproduce the problem.

The bottom line here is that the repair center is getting bad GPUs, but they think the GPUs are fine because they pass some 3DMark test. They then turn around and send these bad GPUs directly to people who had bad GPUs initially. The cycle repeats itself and whether or not you get a good or bad GPU from your RMA is a dice toss.

Here are the 2 serial numbers of the ROG-STRIX-RTX2080TI GPUs that I know are bad but that the repair center tested as good:

  • KBYVCMKK005NPL8
  • KBYVCM01A765KXX

I hope you don’t get one of these cards. This brings me to my 2nd gripe. The first replacement that the ASUS Repair Center shipped me was shipped in a box whose height was less than the physical height of the GPU. The box arrived bulging and where I expected to see a pristinely straight metal back-plate I instead found a plate that was grossly bent outwards as if bashed on the floor by a child.

The Symptom

The symptoms of my crash were as follows:

  • My screen would go black (no video output)
  • GPU fans would go to 100%
  • My PC would continue to function normally
  • Event Viewer would show a log entry: Display driver nvlddmkm stopped responding and has successfully recovered.

When this happened, I would have to reboot the system to recover (despite nvlddmkm claiming it had successfully recovered).

At first I would just reset the PC, but this eventually corrupted my boot drive. So after having to run of chkdsk /f c: followed by sfc /scannow a few times, I figured I had to get smarter about this.

I used this tool called NirCmd to make a batch file that would reboot windows. I made a shortcut to the batch file and mapped the shortcut to the hotkey Ctrl+Alt+4 so that I could safely survive these crashes without further damage to my data.

The batch file was simple, it took a screenshot of the desktop and then rebooted:

C:\nircmd\nircmdc savescreenshot "C:\Users\zach\Desktop\reboot.png"
C:\nircmd\nircmdc exitwin reboot

Reliably Reproducing the Problem

The first step in solving any problem is to reliably reproduce that problem. I tried various benchmarks like Unigine’s Heaven, 3D Mark, games like Overwatch, Apex Legends, Tomb Raider, and anything with ray tracing. I could not reliably reproduce the problem until I started using OCCT.

The Guaranteed Repro

Here is the shortest path to reproduction from my experience.

  1. Launch OCCT
  2. Under Test Schedule, select the Power option.
Select the Power option.

3. Press the Play button

The play button begins the test

4. Wait a few minutes until the GPU crashes

Ruling Out Hardware

What follows are some steps I took to try and rule out specific pieces of hardware. Perhaps it will help other people who are looking to do the same.

Power Supply

Cables

One warning I kept seeing on the internet was the warning not to use a single power cable to connect both 8-pin connectors on the 2080 TI. It was recommended to use 2 power cables to prevent a single cable from carrying too much current.

I was using this dual-power cable configuration from day one. But I wondered, what if my cables are actually bad? So I got my multimeter out and measured the resistance of each pin in both cables. If there were a damaged pin or cable, I should see a measurable resistance on the meter. Everything read perfect 0, so I abandoned these suspicions.

Load

Overall power load was high on my suspicion list, based mostly on ignorance and lack of data. Was my 850W EVGA being asked to do too much? I set out to answer that question with actual data. There are a lot of devices that let you look at your power draw for pretty cheap, like the Kill A Watt device. I measured the power draw of my PC while running the OCCT benchmark and measured a maximum power draw of 475 Watts. Hmm, a lot lower than I expected.

My power supply is 80+ Gold Certified which means it has > 80% efficiency at rated load. The general rule of thumb I hear is to size your power supply so there is 20% of headroom. In my case, I had 44% headroom to spare (1 – 475/850). So power load was not a problem here.

12V Rail

Another thing I read on the internet was that for PSUs that have dual rails, you might want to run in single rail mode to improve stability. My PSU was only a single rail unit, but I wondered, what if the voltage on my rail is unstable and falls outside of the acceptable thresholds? Perhaps this could explain a crash.

Based on some research into PCI-Express power connectors, I found that the 8-pin connector pulls power off the 12V rail and has a maximum power draw of 150W.

Additionally, according to EVGA, the acceptable voltage threshold for the 12V rail should be in the range of 11.8V – 12.4V.

I ran OCCT again, but this time paid very close attention to the Voltages tab as the benchmark ran. I confirmed that leading up to and at the moment of the crash, the voltage on my 12V rail was perfectly within specification.

This data is available in others apps as well, but OCCT makes it easy to see. Here you can see both the +12V rail on the motherboard as well as the 12V input on the GPU:

12V rail on the motherboard with 3 columns showing current value, min and max
12V input on the GPU with 3 columns showing current value, min and max

In conclusion, the power supply was fine according to every measurable metric.

Cooling

Was the GPU getting too hot? Oddly enough I noticed a trend in all of my tests. The crash always occurred when the GPU temperature was around 62 C. This is an extremely low temperature, but it was also interestingly the temperature where the default fan curve would spin the fans up to a higher RPM.

Default fan curve for 2080 TI

So the question is, is the GPU crashing at a specific temperature? To answer this I changed the default fan curve to look like this.

Adjusting the default fan curve so that the GPU gets hotter faster.

Running the same test, the card reached 65 C before crashing. Hmm, curious.

Ok, what if I run with fans set to maximum so that it never reaches that temp?

Adjusting fan curve so fans are always 100%

Now the card crashed at 55 C… Ok, this does not seem to have anything to do with the temperature of the GPU.

Was anything else in my system hot? Not according to the graphs, temps were all extremely low. The interior of the case stayed around 23 C with CPU maxing out at 54 C.

The ambient room temperature in my office is 20 C and the cooling setup on my rig is:

  • Front Case Push – 1 x Silent Wings 120mm
  • Side Case Push – 2 x Silent Wings 140mm (directly onto GPU)
  • Top Case Pull – 2 x Silent Wings 140mm
  • Rear Case Pull 1 x Silent Wings 120mm

GPU

I had an EVGA GTX 1070 as well as a GTX 1050 on hand to test with. I could put either of these cards in and run the same tests just fine for hours at a time. This should have been all of the testing I needed to do really.

To rule out clocking issues, ASUS support recommended running the GPU in silent mode, which did not change the outcome of the test. I went a step further and underclocked the card as far as humanly possible.

Underclocking the GPU

No change.

RAM

If my RAM were bad, I’d expect to be getting the occasional blue screen of death. But I humored myself by running the built-in windows memory test overnight. It couldn’t find any errors.

Maybe the 2080 TI is really sensitive to memory timings, I thought. So I disabled the XMP profile in my BIOS and ran the memory at stock timings with no overclock.

No change.

To be thorough, I thought I should go down to a single RAM stick. But I would need to test all 4 sticks to be sure, no change, no change, no change, no change. So either none of my RAM is bad or all of it is bad in a way that is so subtle that it is impossible to reproduce outside of having a specific GPU.

CPU

Is my CPU overclocked? No. Ok, what else could be wrong with it? If the CPU were bad the PC either wouldn’t power on or windows would blue screen frequently. Have I compiled Unreal Engine 4 from source every single day for the last 4 years without any problems on this CPU? Yes. It’s not the CPU.

Motherboard

What if it’s a problem with my PCI-E port? Let’s try the other port. We can’t, the card only fits in the port closest to the CPU because of our case design. Ok, do we take a Dremel tool to the case? Do we try to setup the motherboard on a table so we can plug into the other ports?

Well, it just so happened I was a week out from building a computer for someone else. So I eventually had a new motherboard, CPU, RAM and 1000W power supply to test with.

No change whatsoever, and this time it was an ASUS motherboard with RAM and CPU model that was on their qualified vendor’s list. So there weren’t any more components left to test that could put the blame on anything but the GPU.

Software Environment

Every time I am asked to “try reinstalling windows” it really irks me. Reinstalling windows should never be a step in solving any problem, it’s a basic admittance that someone has no clue what the problem could be.

If a program or driver is interfering with a piece of hardware you get a memory dump, identify the problem and fix it. It’s software engineering 101. You don’t stick your head in the sand and hope that a problem that you have not tried to identify won’t happen again.

However, because ASUS support asked me specifically to try it, I wanted to honestly say that I did. So I made a stupid USB Win 10 installer, grabbed a crappy old non-solid state SATA drive laying in my recycle bin, installed a fresh copy of Win 10 and reproduced the problem within minutes, cursing under my breath the whole time at how stupid this exercise was.

I feel the same way about driver cleaners. Either a driver is being used or it is not, just having the drivers on your disk does absolutely nothing. To be clear, ASUS did not ask me to use a driver cleaner, I’ve just gotten myself worked up here on a tangent.

Conclusion

In conclusion, I wasted several weeks of effort running tests when all of my instincts told me the problem was a bad GPU from day 1. Being sent another bad GPU by ASUS threw me off my game and wasted more time as I reproduced all of these tests the 2nd time around, eventually culminating in testing on other hardware. Finally, after the 2nd RMA, my testing was vindicated and the GPU I received was stable.

No-one should have to go through this kind of effort to get a working piece of hardware. I am shocked at the lack of quality control that ASUS exhibited in both the retail shipping hardware and their repair center operations.

ASUS Support was responsive and friendly enough and they picked up the shipping cost the 2nd time around. But I am not willing to go through this exercise again, so I’ll be avoiding ASUS GPUs from now on.

Be the first to comment

Leave a Reply