linux-users archive

Re: APIC error on CPU0


New Message Reply Date view Thread view Subject view Author view

From: Dan Pritts (email_suppressed_at_lugwash.org)
Date: Sun 11-Feb-2007 12:53:16 AM EST


Thanks for the advice Joe.

long reply here. I tried to split it up.

i686

----
first, while it didn't completely solve the problem, installing an
actual i686 kernel (I was still running an i586 build required by my
older c3) made the errors much less frequent.  But i did get three
of the messages during about 2.5 hours of heavy disk i/o (rebuilding a raid5
that "lost" a disk when i rebuilt the system, said disk ended up on 
hdk instead of hdc).  
(I halted before the raid5 rebuild completed since i don't trust it.)
msi
---
While it is new out of the box, this is not a brand-new motherboard,
it's roughly 2004 vintage, so i didn't expect to have trouble with
the centos kernel.  I wanted to reuse an older athlon & ram I had,
and had limited choices buying new.  Probably should have spent
more time looking on ebay.
i hadn't been paying attention to this space so hadn't heard of MSI.
However from what i can tell i'm not using it:
  [root_at_nt650 init.d]# cat /proc/interrupts 
             CPU0       
    0:    4938976    IO-APIC-edge  timer
    1:        279    IO-APIC-edge  i8042
    8:          1    IO-APIC-edge  rtc
    9:          0   IO-APIC-level  acpi
   14:     453849    IO-APIC-edge  ide0
   15:      44133    IO-APIC-edge  ide1
  169:     563079   IO-APIC-level  ide2, ide3
  177:     528371   IO-APIC-level  ide4, ide5
  185:          8   IO-APIC-level  libata
  193:      18728   IO-APIC-level  eth0
  NMI:          0 
  LOC:    4938664 
  ERR:          9
  MIS:          0
  
an lwn.net article suggests that i'd see MSI here if i were using it.
correct?
I'll try a new kernel and see if that fixes the problem.   
workarounds?
------------
Assume that my problem is APIC related.  Do you think there's anything
I can do (manually assign interrupts, change slots, etc) that might help?
rebooting without initializing the raid5, i don't get any of the
errors, which suggests to me that the problem is related to these
promise controllers.
misc
----
FWIW (I realize it's not directly related) the board did pass 12
hours with memtest86.  and in case it's not obvious i am not
overclocking.  I considered underclocking, actually.
So i just found this on the LKML faq, which first says "it's only
a problem when there are many of them" and then says "even a few
failures is a cause for concern".  WTF.
# Why does my kernel report lots of "APIC error" messages?
    * (REG, contributed by Mark Hahn) You may get messages like: 
APIC error on CPU1: 00(08).
      APIC is the hardware that ia32 systems use to communicate
between CPUs to handle low-level events like interrupts and TLB
flushes. APIC messages are checksummed, and automatically retried
when they fail. This message indicates that a transaction failed;
it's only a problem when there are many of them. The APIC checksum
is quite weak, so even a few failures is a cause for concern, since
it implies that some corruption has likely gone undetected.
      Assuming you're not forcing your motherboard to use an invalid
system clock (i.e. AGP other than 66 MHz), this is strictly a
physical design flaw in your motherboard. The Abit BP6 is notorious
for this flaw, but it's not unheard of on other boards (such as the
Gigabyte BXD), and it's possible on any board that uses APICs.
      You can force the kernel not to use APIC like this with the
"noapic" kernel option. This also forces CPU0 to handle all interrupts.
On Sat, Feb 10, 2007 at 05:20:53PM -0500, Joe Landman wrote:
> Hi Dan
> 
> Dan Pritts wrote:
> >i appear to have bought a crap motherboard.
> 
> This often happens when people buy cheap.  Motherboards with broadcom 
> NICs can bring a grown admin to tears when they try to actually use the 
> network near its rated speed.  I just love those interrupt storms.  And 
> there is nothing you can do about it, apart from buying a better MB, or 
> in the case of networks, buy a good NIC.
> 
> >
> >I'm now getting this message pretty frequently in dmesg output:
> >
> > APIC error on CPU0: 40(40)
> >
> >Booting with noapic breaks the ethernet card.
> >
> >I'm running the latest bios for the thing. (epox 8krai-pro).  
> >
> >kernel 2.6.9-42.0.3.plus.c4 (centos plus rebuild of redhat kernel,
> >but with most options that redhat turns off re-enabled).
> >
> >I've disabled all the non-essential devices (parallel port, USB, 
> >sound, and removed my video capture card) and it hasn't happened
> >as much.  But it still happens.  I also get one of these at boot
> >time:
> >
> > APIC error on CPU0: 00(40)
> 
> APIC is an interrupt controller.  The 2.6.9 kernels are ancient, and may 
> not support (often do not) later chipsets with MSI support, and do IRQ 
> routing wrong.  Try booting with
> 
> 	nomsi
> 
> or
> 
> 	pci_nomsi
> 
> or
> 
> 	pci=biosirq
> 
> Centos is nice.  It just doesn't support late model hardware very well 
> (as it is not designed to).
> 
> >Googling for this error message gets roughly one billion hits,
> >but none that i've looked at explains what this error actually *is*.
> 
> Odds are that it is interrupt routing.  Some of the distros never got it 
> right, in part because of the kernels they use.  We have had better luck 
> on modern motherboards with kernels after 2.6.17.  We still need to 
> disable MSI as it is hopelessly mucked up on pretty much all hardware we 
> have tried.
> 
> >
> >FWIW the system hasn't crashed but this makes me nervous.
> 
> Interrupts which aren't handled properly are a good reason to be nervous.
> 
> >
> >so, can anyone tell me what this really means?
> 
> During bios setup, your IRQs weren't routed, and the OS didn't route 
> them correctly either.  Likely you have an MSI capable system (all the 
> rage for those big PCIe video/RAID cards), and Linux did not set up MSI 
> properly, or you happen to have a non-MSI compatible card (or 
> unsupported device) in there somewhere.
> 
> Joe
> 
> >
> >danno
> >--
> >dan pritts
> >[e-mail suppressed]
> >734-929-9770
> >--
> >***  Sent from [e-mail suppressed]  ***  http://www.lugwash.org
> >to unsubscribe: `echo "unsubscribe" | mail [e-mail suppressed]`
> 
> -- 
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: [e-mail suppressed]
> web  : http://www.scalableinformatics.com
> phone: +1 734 786 8423
> fax  : +1 734 786 8452
> cell : +1 734 612 4615
danno
--
dan pritts
[e-mail suppressed]
734-929-9770
--
***  Sent from [e-mail suppressed]  ***  http://www.lugwash.org
to unsubscribe: `echo "unsubscribe" | mail [e-mail suppressed]`

New Message Reply Date view Thread view Subject view Author view

This archive was generated by hypermail 2.1.5 : Thu 01-Mar-2007 01:00:01 AM EST