Problems with the poweredge 2970

delldell-poweredgepci

The company I work for just bought 3 PowerEdge 2970 servers and they all have the same problem.

  1. Is this server worth buying or are the problems that come with it make it not worth it?
  2. Are there alot of issues with using an AMD processors (it's an Opteron)?
  3. Are you guys able to pin point the problem if I give details on which errors I get in the event logs?

Here is the problem:

1.Power on server. It boots up to the red hat splash screen.
2.In the middle of the boot up the server crashes with the following errors:

-CPU Machine Chk: processor sensor, transition to non-recoverable was asserted
-PCI Parity Err: critical event sensor, PCI PERR (BUS 0 DEVICE 1 FUNC 0)

Then I tried to update the bios and the BMC but the problem was still there.
After that I tried to update the OS (it had red hat Enterprise 5.1) to red Hat 5.3
There was something odd there too. I booted the server with the Build and update utility then selected install OS. I selected red hat enterprise 5.3 x86_64. It queried me for the x86_64 media so I put in the disc that said : supplementary disc 1 of 1 for 64-bit AMD64 and Intel 64. It said wrong disc. So then I used the disc that said: installation disc 1 of 1 for 64-bit Intel Itanium. My guess is thats the disc I needed to use all along.

After this the system was able to boot up to the command line login screen. I loggued in and typed : startx to get into the gui environment. At that point less than a page of text scrolled fast and the server crashed without showing anything gui related.

At that point I had at 2 different errors(notice the device is 4 now, gonna check which device it is):

-PCI Parity Err: critical event sensor, PCI PERR (BUS 0 DEVICE 4 FUNC 0)
-PCI Sytem Error:critical event sensor, PCI SERR(BUS 0 DEVICE 4 FUNC 0)

So today the tech guy came with a bunch of parts and basically rebuilt the server (PCI riser, mother board, DIMMs, a SAS card and something else I cant figure off the top of my head)on site but after that the problems were even worse. Some of these errors were(mind you at that point he was putting back some of the original parts so things got messy):

ECC uncorr Err: memory sensor, uncorrectable ECC (DIMM1 DIMM2) was asserted.
E1231 1.2V HT core power GD
E1911 <3 ERRORS check log
E1000 failsafe

Tomorrow he is coming back with a power supply…

UPDATE: Seems like I cant waste anymore time on this. We are calling the sales people and asking for new servers.

Best Answer

I have ran into similar problems with Dell of late. The tech support doesn't seem to be able to directly associated the errors with the failed part. Alot of the time they just send out what i like to call "The I Have No Idea Whats Wrong Parts Pack". Usually consists of Systemboard, PCI riser, replacement memory and sometimes a replacement CPU and RAID controller.

One thing they often forget to replace is the riser for the integrated PERC card. And I have seen that be the issue a few times.

Anyways as I commented before unless you are in a real rush to deploy these servers I would contact Dell customer care and demand that all three servers are replaced or refunded.

Related Topic