214789 – ehci-hcd.c ISR

Bug 214789 - ehci-hcd.c ISR

Summary: ehci-hcd.c ISR

Status:	NEW

Alias:	None

Product:	Drivers
Classification:	Unclassified
Component:	USB (show other bugs)
Hardware:	x86-64 Linux

Importance:	P1 high
Assignee:	Default virtual assignee for Drivers/USB

URL:
Keywords:

Depends on:
Blocks:

Reported:	2021-10-21 04:29 UTC by Scott Arnold
Modified:	2021-12-06 21:44 UTC (History)
CC List:	1 user (show)

See Also:
Kernel Version:	5.11+
Subsystem:
Regression:	No
Bisected commit-id:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Description Scott Arnold 2021-10-21 04:29:42 UTC

Change in ehci_irq from spin_lock_irqsave/irqrestore to spin_lock/unlock broke shared IRQ's

Comment 1 Greg Kroah-Hartman 2021-10-21 05:56:13 UTC

On Thu, Oct 21, 2021 at 04:29:42AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> Change in ehci_irq from spin_lock_irqsave/irqrestore to spin_lock/unlock
> broke
> shared IRQ's

What shared irq broke exactly?  For what hardware platform?  And what
kernel version worked for you and now does not work?

thanks,

greg k-h

Comment 2 Scott Arnold 2021-10-21 15:42:33 UTC

Hello,
The driver is for a Symmetricom bc635pcie Time and Frequency Processor. Driver developed in house for ISS simulations.
Works fine on 5.3 and prior kernels.
Works fine with 5.11 if not using shared IRQ's. No interrupts with shared IRQ's.
Machine is a HP dl580 G9 w/ 72 Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz.
We only have a few of these machines the rest are G10's and work fine (not sharing IRQ's with timing cards).
I am having the card moved to another slot to try and get a unshared IRQ.
Unfortunately MSI does not seem possible with this card.
Thanks
Scott

-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Thursday, October 21, 2021 12:56 AM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7C149886803eb94e8808d708d994577fa2%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637703925801703360%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gCbZYRYzD3U0wbZswfjv%2FGeEe3Y3VEua%2BmqL1opeCOM%3D&amp;reserved=0

--- Comment #1 from Greg Kroah-Hartman (greg@kroah.com) --- On Thu, Oct 21, 2021 at 04:29:42AM +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> 
> Change in ehci_irq from spin_lock_irqsave/irqrestore to 
> spin_lock/unlock broke shared IRQ's

What shared irq broke exactly?  For what hardware platform?  And what kernel version worked for you and now does not work?

thanks,

greg k-h

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 3 Alan Stern 2021-10-21 16:02:35 UTC

Can you provide more information about exactly what goes wrong?  And why shared IRQs should make any difference?

Comment 4 Scott Arnold 2021-10-21 16:13:23 UTC

Hello,
Timing card is not receiving interrupts when sharing a IRQ (16) with the ehci-hcd driver and the 5.11.17  (also tried 5.14.13) kernel.
I changed the spin_lock/spin_unlock back to spin_lock_irqsave/irqrestore in ehci_irq in the 5.14.13 kernel and timing card now getting interrupts as expected.
The 5.3 (and probably prior although I have not checked) are using the irqsave/restore versions of spinlock in ehci_irq.
From 5.3:
       /*
         * For threadirqs option we use spin_lock_irqsave() variant to prevent
         * deadlock with ehci hrtimer callback, because hrtimer callbacks run
         * in interrupt context even when threadirqs is specified. We can go
         * back to spin_lock() variant when hrtimer callbacks become threaded.
         */
        spin_lock_irqsave(&ehci->lock, flags);

Thanks
Scott

-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Thursday, October 21, 2021 11:03 AM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7Ce7de881063ce46edc00808d994ac34f6%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637704289608055302%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=klAQp1A7%2FNOCGgcfJ29A2EQwDD%2Fa5UQbX%2BLUaHU5fb4%3D&amp;reserved=0

--- Comment #3 from Alan Stern (stern@rowland.harvard.edu) --- Can you provide more information about exactly what goes wrong?  And why shared IRQs should make any difference?

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 5 Alan Stern 2021-10-21 19:27:49 UTC

Okay, but _why_ don't the timing card's interrupts get handled when ehci_irq uses spin_lock?  And _why_ does changing to spin_lock_irqsave make a difference?

Do all of the card's interrupt requests get lost or only some of them?

Are you somehow getting recursive (nested) interrupts for the same IRQ line?

Is ehci_irq somehow getting called with interrupts enabled?

I don't want to make any changes to the driver until we know the answers to these questions.

Comment 6 Scott Arnold 2021-10-21 19:36:26 UTC

Hello,
I don't know why it makes a difference but according to /proc/interrupts IRQ16 gets about 90 interrupts and stops, when working properly card generates 240 interrupts/second.
Works fine with irqsave/restore in the ehci-hcd isr.
Ehci-hcd is built in and not a module in our configuration.
I am having the card moved to another slot now.

Thanks
Scott
-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Thursday, October 21, 2021 2:28 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7C551e718474f3400d4b2508d994c8e126%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637704412758963825%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qrlaDQOIszpc5Q%2BNbQUxzpDh%2FIyBIvCclogFe2OAAm0%3D&amp;reserved=0

--- Comment #5 from Alan Stern (stern@rowland.harvard.edu) --- Okay, but _why_ don't the timing card's interrupts get handled when ehci_irq uses spin_lock?  And _why_ does changing to spin_lock_irqsave make a difference?

Do all of the card's interrupt requests get lost or only some of them?

Are you somehow getting recursive (nested) interrupts for the same IRQ line?

Is ehci_irq somehow getting called with interrupts enabled?

I don't want to make any changes to the driver until we know the answers to these questions.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 7 Scott Arnold 2021-10-21 19:37:56 UTC

Anything you want me to try while I have the hardware available?

-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Thursday, October 21, 2021 2:28 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7C551e718474f3400d4b2508d994c8e126%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637704412758963825%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qrlaDQOIszpc5Q%2BNbQUxzpDh%2FIyBIvCclogFe2OAAm0%3D&amp;reserved=0

--- Comment #5 from Alan Stern (stern@rowland.harvard.edu) --- Okay, but _why_ don't the timing card's interrupts get handled when ehci_irq uses spin_lock?  And _why_ does changing to spin_lock_irqsave make a difference?

Do all of the card's interrupt requests get lost or only some of them?

Are you somehow getting recursive (nested) interrupts for the same IRQ line?

Is ehci_irq somehow getting called with interrupts enabled?

I don't want to make any changes to the driver until we know the answers to these questions.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 8 Alan Stern 2021-10-21 19:46:20 UTC

How many of those 90 interrupts were issued by the EHCI host controller as opposed to the card?

Are there any USB devices attached to the host controller?  What does /sys/kernel/debug/usb/devices have to say?

Does the card use edge-triggered interrupts rather than level-triggered?

Have you tried adding any debugging messages to ehci_irq to find out what's going on when it runs?

Comment 9 Scott Arnold 2021-10-21 20:04:06 UTC

Hello,
The timing card driver receives no interrupts according to debug.
IRQ16:	IO-APIC   16-fasteoi   ehci_hcd:usb1, hpilo, rt_pcclk

Ehci_hcd entries in /sys/kernel/debug/usb/devices:

T:  Bus=01 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=480  MxCh= 2
B:  Alloc=  0/800 us ( 0%), #Int=  4, #Iso=  0
D:  Ver= 2.00 Cls=09(hub  ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=1d6b ProdID=0002 Rev= 5.11
S:  Manufacturer=Linux 5.11.17_OBCS_1 ehci_hcd
S:  Product=EHCI Host Controller
S:  SerialNumber=0000:00:1a.0
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=  0mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   4 Ivl=256ms

T:  Bus=02 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=480  MxCh= 2
B:  Alloc=  0/800 us ( 0%), #Int=  0, #Iso=  0
D:  Ver= 2.00 Cls=09(hub  ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=1d6b ProdID=0002 Rev= 5.11
S:  Manufacturer=Linux 5.11.17_OBCS_1 ehci_hcd
S:  Product=EHCI Host Controller
S:  SerialNumber=0000:00:1d.0
C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=  0mA
I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
E:  Ad=81(I) Atr=03(Int.) MxPS=   4 Ivl=256ms

I have not tried enabling debug in ehci_hcd yet.
Moving card to another slot did not help, still sharing IRQ 16

Thanks
Scott
-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Thursday, October 21, 2021 2:46 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7Ccc119a80759145cdeec508d994cb775f%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637704423875337090%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=J15Rmcdsnn%2Bp7O%2FFtRSqnlohPqS%2FxEGqznFqLMQaNOs%3D&amp;reserved=0

--- Comment #8 from Alan Stern (stern@rowland.harvard.edu) --- How many of those 90 interrupts were issued by the EHCI host controller as opposed to the card?

Are there any USB devices attached to the host controller?  What does /sys/kernel/debug/usb/devices have to say?

Does the card use edge-triggered interrupts rather than level-triggered?

Have you tried adding any debugging messages to ehci_irq to find out what's going on when it runs?

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 10 Johan Hovold 2021-10-22 08:53:32 UTC

On Thu, Oct 21, 2021 at 08:04:06PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=214789
> 
> --- Comment #9 from Scott Arnold (scott.c.arnold@nasa.gov) ---
> Hello,
> The timing card driver receives no interrupts according to debug.
> IRQ16:  IO-APIC   16-fasteoi   ehci_hcd:usb1, hpilo, rt_pcclk

So the IRQ is also shared with some (mainline) HP management-processor
driver (hpilo).

> Ehci_hcd entries in /sys/kernel/debug/usb/devices:
> 
> T:  Bus=01 Lev=00 Prnt=00 Port=00 Cnt=00 Dev#=  1 Spd=480  MxCh= 2
> B:  Alloc=  0/800 us ( 0%), #Int=  4, #Iso=  0
> D:  Ver= 2.00 Cls=09(hub  ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
> P:  Vendor=1d6b ProdID=0002 Rev= 5.11
> S:  Manufacturer=Linux 5.11.17_OBCS_1 ehci_hcd
> S:  Product=EHCI Host Controller
> S:  SerialNumber=0000:00:1a.0
> C:* #Ifs= 1 Cfg#= 1 Atr=e0 MxPwr=  0mA
> I:* If#= 0 Alt= 0 #EPs= 1 Cls=09(hub  ) Sub=00 Prot=00 Driver=hub
> E:  Ad=81(I) Atr=03(Int.) MxPS=   4 Ivl=256ms

Just to be clear: Are there any USB devices physically connected to this
bus?

What does "lsusb -s1:" say?

> I have not tried enabling debug in ehci_hcd yet.

Try adding

	WARN_ON_ONCE(!irqs_disabled());

at the start of ehci_irq() before grabbing the lock.

That should give us a stack dump in case there's someone calling the
interrupt handler with interrupts enabled (which seems to be the case).

Johan

Comment 11 Johan Hovold 2021-10-25 13:19:47 UTC

[ Adding back bugzilla and linux-usb on CC. ]

On Fri, Oct 22, 2021 at 07:43:04PM +0000, Arnold, Scott C. (JSC-CD13)[SGT, INC] wrote:
> I added the WARN_ON_ONCE(!irqs_disabled()); at the beginning of ehci-irq
> before the lock and did not notice anything.

Ok, so interrupts are already disabled as they should be.

> However after looking at the logs I discovered:
> 
> [    5.189043] irq 16: nobody cared (try booting with the "irqpoll" option)
> [    5.189112] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.14.13_OBCS_1 #4
> [    5.189180] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 Gen9,
> BIOS U17
> 01/22/2018
> [    5.189261] Call Trace:
> [    5.189324]  <IRQ>
> [    5.189381]  ? dump_stack_lvl+0x33/0x42
> [    5.189445]  ? __report_bad_irq+0x32/0xac
> [    5.189505]  ? note_interrupt.cold.11+0xa/0x63
> [    5.189562]  ? handle_irq_event_percpu+0x65/0x70
> [    5.189623]  ? handle_irq_event+0x32/0x50
> [    5.189681]  ? handle_fasteoi_irq+0xa1/0x160
> [    5.189740]  ? __common_interrupt+0x3c/0xa0
> [    5.189798]  ? common_interrupt+0x7a/0xa0
> [    5.189859]  </IRQ>
> [    5.189913]  ? asm_common_interrupt+0x1b/0x40
> [    5.189973]  ? mwait_idle+0x50/0x70
> [    5.190031]  ? default_idle+0x10/0x10
> [    5.190088]  ? default_idle_call+0x1f/0x30
> [    5.190147]  ? do_idle+0x1df/0x1f0
> [    5.190207]  ? cpu_startup_entry+0x14/0x20
> [    5.190267]  ? start_kernel+0x616/0x63d
> [    5.190328]  ? secondary_startup_64_no_verify+0xb0/0xbb
> [    5.190388] handlers:
> [    5.190442] [<00000000da7aaaea>] usb_hcd_irq
> [    5.190504] Disabling IRQ #16
> 
> [    5.201827] irq 23: nobody cared (try booting with the "irqpoll" option)
> [    5.201885] CPU: 1 PID: 8 Comm: kworker/u145:0 Not tainted 5.14.13_OBCS_1
> #4
> [    5.201942] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 Gen9,
> BIOS U17
> 01/22/2018
> [    5.202010] Workqueue: events_unbound async_run_entry_fn
> [    5.202069] Call Trace:
> [    5.202119]  <IRQ>
> [    5.202168]  ? dump_stack_lvl+0x33/0x42
> [    5.202223]  ? __report_bad_irq+0x32/0xac
> [    5.202277]  ? note_interrupt.cold.11+0xa/0x63
> [    5.202332]  ? handle_irq_event_percpu+0x65/0x70
> [    5.202386]  ? handle_irq_event+0x32/0x50
> [    5.202441]  ? handle_fasteoi_irq+0xa1/0x160
> [    5.202495]  ? __common_interrupt+0x3c/0xa0
> [    5.202548]  ? common_interrupt+0x7a/0xa0
> [    5.202603]  </IRQ>
> [    5.202652]  ? asm_common_interrupt+0x1b/0x40
> [    5.202707]  ? inflate_fast+0x118/0x5e0
> [    5.202764]  ? zlib_inflate+0x3d1/0x1770
> [    5.202817]  ? do_copy+0xed/0x109
> [    5.202869]  ? write_buffer+0x22/0x32
> [    5.202921]  ? initrd_load+0x268/0x268
> [    5.202975]  ? write_buffer+0x32/0x32
> [    5.203026]  ? __gunzip+0x244/0x310
> [    5.203083]  ? decompress_method+0x3c/0x3c
> [    5.203137]  ? initrd_load+0x268/0x268
> [    5.203190]  ? gunzip+0xe/0x11
> [    5.203243]  ? initrd_load+0x268/0x268
> [    5.203296]  ? unpack_to_rootfs+0x14f/0x285
> [    5.203349]  ? initrd_load+0x268/0x268
> [    5.203402]  ? do_populate_rootfs+0x6c/0x160
> [    5.203455]  ? async_run_entry_fn+0x1b/0xa0
> [    5.203508]  ? process_one_work+0x1d1/0x330
> [    5.203563]  ? worker_thread+0x28/0x3d0
> [    5.203615]  ? mod_delayed_work_on+0x90/0x90
> [    5.203668]  ? kthread+0x120/0x150
> [    5.203720]  ? set_kthread_struct+0x30/0x30
> [    5.203773]  ? ret_from_fork+0x22/0x30
> [    5.203826] handlers:
> [    5.203875] [<00000000da7aaaea>] usb_hcd_irq
> [    5.203930] Disabling IRQ #23

So this happens also for another EHCI bus IRQ. Is this IRQ also shared
with something?

> [   62.407444] irq 16: nobody cared (try booting with the "irqpoll" option)
> [   62.407474] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.14.13_OBCS_1 #4
> [   62.407499] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 Gen9,
> BIOS U17
> 01/22/2018
> [   62.407527] Call Trace:
> [   62.407538]  <IRQ>
> [   62.407547]  ? dump_stack_lvl+0x33/0x42
> [   62.407569]  ? __report_bad_irq+0x32/0xac
> [   62.407588]  ? note_interrupt.cold.11+0xa/0x63
> [   62.407606]  ? handle_irq_event_percpu+0x65/0x70
> [   62.407626]  ? handle_irq_event+0x32/0x50
> [   62.407642]  ? handle_fasteoi_irq+0xa1/0x160
> [   62.408250]  ? __common_interrupt+0x3c/0xa0
> [   62.408820]  ? common_interrupt+0x7a/0xa0
> [   62.409386]  </IRQ>
> [   62.409934]  ? asm_common_interrupt+0x1b/0x40
> [   62.410483]  ? mwait_idle+0x50/0x70
> [   62.411026]  ? default_idle+0x10/0x10
> [   62.411565]  ? default_idle_call+0x1f/0x30
> [   62.412102]  ? do_idle+0x1df/0x1f0
> [   62.412634]  ? cpu_startup_entry+0x14/0x20
> [   62.413164]  ? start_kernel+0x616/0x63d
> [   62.413694]  ? secondary_startup_64_no_verify+0xb0/0xbb
> [   62.414218] handlers:
> [   62.414734] [<00000000da7aaaea>] usb_hcd_irq
> [   62.415257] [<000000008857253d>] ilo_isr [hpilo]
> [   62.415775] Disabling IRQ #16
> 
> There is one usb device " Bus 001 Device 003: ID 14dd:1007 Raritan
> Computer, Inc.  D2CIM-VUSB KVM connector" and it has disappeared
> (i.e.not working)

Thanks for confirming.

> This does not occur without the irqsave/restore in the ehci-hcd.

Now why would simply saving the interrupt state in ehci_irq() prevent
these spurious IRQs? There's something fishy going on alright.

> My timercard driver is not loaded. This is with a 5.14.13 kernel.

Are you able to reproduce this on a machine without the timer card
present at all?

> Lsusb -s1:
> Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
> Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
> 
> Lsb -s2 and -s3 are blank.

Looks like you forgot the colon in "lsusb -s1:" so this lists the
devices with number 1 instead of the devices connected to bus 1.
 
> On another identical machine (almost has 92 cores instead of 72) running
> 5.3.6 kernel:
> Lsusb -s1:
> 
> Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
> Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
> Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
> 
> Lsusb -s2:
> Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
> Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
> 
> Lsusb -s3:
> 
> Bus 002 Device 003: ID 0424:2660 Microchip Technology, Inc. (formerly SMSC)
> Hub
> Bus 001 Device 003: ID 14dd:1007 Raritan Computer, Inc. D2CIM-VUSB KVM
> connector

Ok, but there is a device connected to bus 1 as you also mentioned
above.

On Fri, Oct 22, 2021 at 09:38:28PM +0000, Arnold, Scott C. (JSC-CD13)[SGT, INC] wrote:
> Just as another datapoint I put the ehci-hcd.c file from the 5.3.6
> kernel into the 5.14.13 kernel.
> No more "nobody cared" messages but neither timer card or USB is working
> now.

Yeah, that probably not going to work.

> [    6.798509] usb 2-1: new high-speed USB device number 4 using ehci-pci
> [    6.798586] usb 1-1: new high-speed USB device number 4 using ehci-pci
> [    7.238498] usb 1-1: device not accepting address 4, error -32
> [    7.238562] usb 2-1: device not accepting address 4, error -32
> [    7.388499] usb 1-1: new high-speed USB device number 5 using ehci-pci
> [    7.388561] usb 2-1: new high-speed USB device number 5 using ehci-pci
> [    7.828496] usb 1-1: device not accepting address 5, error -32
> [    7.828557] usb 2-1: device not accepting address 5, error -32
> [    7.828618] usb usb1-port1: unable to enumerate USB device
> [    7.828678] usb usb2-port1: unable to enumerate USB device
> 
> /proc/interrupts for IRQ 16 is stuck at 50.
> 
> Some combination of these two may solve problem.

It would be good if we could rule out the timer card being involved in
this (e.g. since the driver is out of tree).

Johan

Comment 12 Scott Arnold 2021-10-26 21:49:06 UTC

Hello,
Sorry for the late reply, Outlook sometimes put emails in "Other" for some reason.
I just reverted that machine back to the 5.3.6 kernel.
Now IRQ 16  has:
16: IO-APIC   16-fasteoi   ehci_hcd:usb1, uhci_hcd:usb3, hpilo, rt_pcclk
"uhci_hcd:usb3" does not appear with the 5.11+ kernels (with or without rt_pcclk), maybe the issue is with uhci_hcd.
Timer card works fine in this config.
Thanks
Scott

-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Monday, October 25, 2021 8:20 AM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7Cfabcab1a2f274f35635c08d997ba2068%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637707647925383291%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wW0oHnSHrC19%2F3uz37f5R%2FuyUpfpNIl38tY%2BJ0rnBbA%3D&amp;reserved=0

--- Comment #11 from Johan Hovold (johan@kernel.org) --- [ Adding back bugzilla and linux-usb on CC. ]

On Fri, Oct 22, 2021 at 07:43:04PM +0000, Arnold, Scott C. (JSC-CD13)[SGT, INC]
wrote:
> I added the WARN_ON_ONCE(!irqs_disabled()); at the beginning of 
> ehci-irq before the lock and did not notice anything.

Ok, so interrupts are already disabled as they should be.

> However after looking at the logs I discovered:
> 
> [    5.189043] irq 16: nobody cared (try booting with the "irqpoll" option)
> [    5.189112] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.14.13_OBCS_1 #4
> [    5.189180] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 Gen9,
> BIOS U17
> 01/22/2018
> [    5.189261] Call Trace:
> [    5.189324]  <IRQ>
> [    5.189381]  ? dump_stack_lvl+0x33/0x42
> [    5.189445]  ? __report_bad_irq+0x32/0xac
> [    5.189505]  ? note_interrupt.cold.11+0xa/0x63
> [    5.189562]  ? handle_irq_event_percpu+0x65/0x70
> [    5.189623]  ? handle_irq_event+0x32/0x50
> [    5.189681]  ? handle_fasteoi_irq+0xa1/0x160
> [    5.189740]  ? __common_interrupt+0x3c/0xa0
> [    5.189798]  ? common_interrupt+0x7a/0xa0
> [    5.189859]  </IRQ>
> [    5.189913]  ? asm_common_interrupt+0x1b/0x40
> [    5.189973]  ? mwait_idle+0x50/0x70
> [    5.190031]  ? default_idle+0x10/0x10
> [    5.190088]  ? default_idle_call+0x1f/0x30
> [    5.190147]  ? do_idle+0x1df/0x1f0
> [    5.190207]  ? cpu_startup_entry+0x14/0x20
> [    5.190267]  ? start_kernel+0x616/0x63d
> [    5.190328]  ? secondary_startup_64_no_verify+0xb0/0xbb
> [    5.190388] handlers:
> [    5.190442] [<00000000da7aaaea>] usb_hcd_irq
> [    5.190504] Disabling IRQ #16
> 
> [    5.201827] irq 23: nobody cared (try booting with the "irqpoll" option)
> [    5.201885] CPU: 1 PID: 8 Comm: kworker/u145:0 Not tainted 5.14.13_OBCS_1
> #4
> [    5.201942] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 Gen9,
> BIOS U17
> 01/22/2018
> [    5.202010] Workqueue: events_unbound async_run_entry_fn
> [    5.202069] Call Trace:
> [    5.202119]  <IRQ>
> [    5.202168]  ? dump_stack_lvl+0x33/0x42
> [    5.202223]  ? __report_bad_irq+0x32/0xac
> [    5.202277]  ? note_interrupt.cold.11+0xa/0x63
> [    5.202332]  ? handle_irq_event_percpu+0x65/0x70
> [    5.202386]  ? handle_irq_event+0x32/0x50
> [    5.202441]  ? handle_fasteoi_irq+0xa1/0x160
> [    5.202495]  ? __common_interrupt+0x3c/0xa0
> [    5.202548]  ? common_interrupt+0x7a/0xa0
> [    5.202603]  </IRQ>
> [    5.202652]  ? asm_common_interrupt+0x1b/0x40
> [    5.202707]  ? inflate_fast+0x118/0x5e0
> [    5.202764]  ? zlib_inflate+0x3d1/0x1770
> [    5.202817]  ? do_copy+0xed/0x109
> [    5.202869]  ? write_buffer+0x22/0x32
> [    5.202921]  ? initrd_load+0x268/0x268
> [    5.202975]  ? write_buffer+0x32/0x32
> [    5.203026]  ? __gunzip+0x244/0x310
> [    5.203083]  ? decompress_method+0x3c/0x3c
> [    5.203137]  ? initrd_load+0x268/0x268
> [    5.203190]  ? gunzip+0xe/0x11
> [    5.203243]  ? initrd_load+0x268/0x268
> [    5.203296]  ? unpack_to_rootfs+0x14f/0x285
> [    5.203349]  ? initrd_load+0x268/0x268
> [    5.203402]  ? do_populate_rootfs+0x6c/0x160
> [    5.203455]  ? async_run_entry_fn+0x1b/0xa0
> [    5.203508]  ? process_one_work+0x1d1/0x330
> [    5.203563]  ? worker_thread+0x28/0x3d0
> [    5.203615]  ? mod_delayed_work_on+0x90/0x90
> [    5.203668]  ? kthread+0x120/0x150
> [    5.203720]  ? set_kthread_struct+0x30/0x30
> [    5.203773]  ? ret_from_fork+0x22/0x30
> [    5.203826] handlers:
> [    5.203875] [<00000000da7aaaea>] usb_hcd_irq
> [    5.203930] Disabling IRQ #23

So this happens also for another EHCI bus IRQ. Is this IRQ also shared with something?

> [   62.407444] irq 16: nobody cared (try booting with the "irqpoll" option)
> [   62.407474] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.14.13_OBCS_1 #4
> [   62.407499] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 Gen9,
> BIOS U17
> 01/22/2018
> [   62.407527] Call Trace:
> [   62.407538]  <IRQ>
> [   62.407547]  ? dump_stack_lvl+0x33/0x42
> [   62.407569]  ? __report_bad_irq+0x32/0xac
> [   62.407588]  ? note_interrupt.cold.11+0xa/0x63
> [   62.407606]  ? handle_irq_event_percpu+0x65/0x70
> [   62.407626]  ? handle_irq_event+0x32/0x50
> [   62.407642]  ? handle_fasteoi_irq+0xa1/0x160
> [   62.408250]  ? __common_interrupt+0x3c/0xa0
> [   62.408820]  ? common_interrupt+0x7a/0xa0
> [   62.409386]  </IRQ>
> [   62.409934]  ? asm_common_interrupt+0x1b/0x40
> [   62.410483]  ? mwait_idle+0x50/0x70
> [   62.411026]  ? default_idle+0x10/0x10
> [   62.411565]  ? default_idle_call+0x1f/0x30
> [   62.412102]  ? do_idle+0x1df/0x1f0
> [   62.412634]  ? cpu_startup_entry+0x14/0x20
> [   62.413164]  ? start_kernel+0x616/0x63d
> [   62.413694]  ? secondary_startup_64_no_verify+0xb0/0xbb
> [   62.414218] handlers:
> [   62.414734] [<00000000da7aaaea>] usb_hcd_irq
> [   62.415257] [<000000008857253d>] ilo_isr [hpilo]
> [   62.415775] Disabling IRQ #16
> 
> There is one usb device " Bus 001 Device 003: ID 14dd:1007 Raritan 
> Computer, Inc.  D2CIM-VUSB KVM connector" and it has disappeared 
> (i.e.not working)

Thanks for confirming.

> This does not occur without the irqsave/restore in the ehci-hcd.

Now why would simply saving the interrupt state in ehci_irq() prevent these spurious IRQs? There's something fishy going on alright.

> My timercard driver is not loaded. This is with a 5.14.13 kernel.

Are you able to reproduce this on a machine without the timer card present at all?

> Lsusb -s1:
> Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 003 
> Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 001 Device 
> 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
> 
> Lsb -s2 and -s3 are blank.

Looks like you forgot the colon in "lsusb -s1:" so this lists the devices with number 1 instead of the devices connected to bus 1.

> On another identical machine (almost has 92 cores instead of 72) 
> running
> 5.3.6 kernel:
> Lsusb -s1:
> 
> Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub Bus 003 
> Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 001 Device 
> 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
> 
> Lsusb -s2:
> Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching 
> Hub Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate 
> Matching Hub
> 
> Lsusb -s3:
> 
> Bus 002 Device 003: ID 0424:2660 Microchip Technology, Inc. (formerly 
> SMSC) Hub Bus 001 Device 003: ID 14dd:1007 Raritan Computer, Inc. 
> D2CIM-VUSB KVM connector

Ok, but there is a device connected to bus 1 as you also mentioned above.

On Fri, Oct 22, 2021 at 09:38:28PM +0000, Arnold, Scott C. (JSC-CD13)[SGT, INC]
wrote:
> Just as another datapoint I put the ehci-hcd.c file from the 5.3.6 
> kernel into the 5.14.13 kernel.
> No more "nobody cared" messages but neither timer card or USB is 
> working now.

Yeah, that probably not going to work.

> [    6.798509] usb 2-1: new high-speed USB device number 4 using ehci-pci
> [    6.798586] usb 1-1: new high-speed USB device number 4 using ehci-pci
> [    7.238498] usb 1-1: device not accepting address 4, error -32
> [    7.238562] usb 2-1: device not accepting address 4, error -32
> [    7.388499] usb 1-1: new high-speed USB device number 5 using ehci-pci
> [    7.388561] usb 2-1: new high-speed USB device number 5 using ehci-pci
> [    7.828496] usb 1-1: device not accepting address 5, error -32
> [    7.828557] usb 2-1: device not accepting address 5, error -32
> [    7.828618] usb usb1-port1: unable to enumerate USB device
> [    7.828678] usb usb2-port1: unable to enumerate USB device
> 
> /proc/interrupts for IRQ 16 is stuck at 50.
> 
> Some combination of these two may solve problem.

It would be good if we could rule out the timer card being involved in this (e.g. since the driver is out of tree).

Johan

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 13 Alan Stern 2021-10-27 02:05:39 UTC

What does the kernel log from a recent kernel say about UHCI?  If it is present in the older kernel then it should still show up in the recent kernel.

And if the problem is related to uhci-hcd, why would patching ehci-hcd make it go away?

Comment 14 Scott Arnold 2021-10-27 17:15:00 UTC

Hello,
Sorry for the confusion.
While patching echi-hcd.c using meld to change spin_locks " current_status = ehci_readl(ehci, &ehci->regs->status);" got changed to "status = ehci_readl(ehci, &ehci->regs->status);" (~line 723).
As a result current_status was used un-initialized (I would have thought gcc 8.3.1 would have warned about that).
This caused the "nobody cared" messages on IRQ's 16 and 23. So ehci-hcd did not load/allocate the IRQ's.
Spin_locks are not the problem.
If ehci-hcd does not load the timer card works fine.
If ehci-hcd loads the timer card will not work even if ehci-hcd is unloaded. IRQ count on 16 stuck at ~90 in /proc/interrupts.
Seems like IRQ 16 is somehow left disabled. Could be why uhci-hcd does not load ?
All works fine with the 5.3.6 kernel, broken in 5.11+.

Lsudb with 5.3.6:

Bus 002 Device 003: ID 0424:2660 Microchip Technology, Inc. (formerly SMSC) Hub
Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 003: ID 14dd:1007 Raritan Computer, Inc. D2CIM-VUSB KVM connector
Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Thanks
Scott
 
-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Tuesday, October 26, 2021 9:06 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7Cb7b68d317a8a42969d9508d998ee48e8%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637708971458208168%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Zk3yh4xfKyj7NsjmOmmnfWZ%2FukvbAOpeKrziViiPVUc%3D&amp;reserved=0

--- Comment #13 from Alan Stern (stern@rowland.harvard.edu) --- What does the kernel log from a recent kernel say about UHCI?  If it is present in the older kernel then it should still show up in the recent kernel.

And if the problem is related to uhci-hcd, why would patching ehci-hcd make it go away?

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 15 Alan Stern 2021-10-27 20:07:05 UTC

Can you try testing some kernel versions between 5.3 and 5.11 to see exactly at what point the problem was introduced?

Comment 16 Scott Arnold 2021-10-27 20:39:12 UTC

Hello,
Not easily, the machine is back in general development use. 
I'll see what I can do.
Thanks
Scott

-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Wednesday, October 27, 2021 3:07 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7Cf3685d8b91f04999847c08d999855b9b%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637709620310380793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xMQ2XRrPyP6IpmTn%2B3bE72joD0w%2FPIYMpZdqvbsAfdY%3D&amp;reserved=0

--- Comment #15 from Alan Stern (stern@rowland.harvard.edu) --- Can you try testing some kernel versions between 5.3 and 5.11 to see exactly at what point the problem was introduced?

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 17 Scott Arnold 2021-10-27 21:14:28 UTC

Hello,
5.4 and 5.9 look the most suspicious based on release notes. I will try those when I get the OK.
Thanks
Scott

-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Wednesday, October 27, 2021 3:07 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7Cf3685d8b91f04999847c08d999855b9b%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637709620310380793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xMQ2XRrPyP6IpmTn%2B3bE72joD0w%2FPIYMpZdqvbsAfdY%3D&amp;reserved=0

--- Comment #15 from Alan Stern (stern@rowland.harvard.edu) --- Can you try testing some kernel versions between 5.3 and 5.11 to see exactly at what point the problem was introduced?

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 18 Scott Arnold 2021-10-29 15:41:53 UTC

Hello,
So far 5.9.1 is broken, 5.4.1 is ok.
I will try 5.8 and 5.5 next.
Thanks
Scott

-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Wednesday, October 27, 2021 3:07 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7Cf3685d8b91f04999847c08d999855b9b%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637709620310380793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xMQ2XRrPyP6IpmTn%2B3bE72joD0w%2FPIYMpZdqvbsAfdY%3D&amp;reserved=0

--- Comment #15 from Alan Stern (stern@rowland.harvard.edu) --- Can you try testing some kernel versions between 5.3 and 5.11 to see exactly at what point the problem was introduced?

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 19 Scott Arnold 2021-11-01 19:07:13 UTC

Hello,
5.6.1 kernel OK.
5.7.1+ kernel not OK
Thanks
Scott

-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Wednesday, October 27, 2021 3:07 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7Cf3685d8b91f04999847c08d999855b9b%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637709620310380793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xMQ2XRrPyP6IpmTn%2B3bE72joD0w%2FPIYMpZdqvbsAfdY%3D&amp;reserved=0

--- Comment #15 from Alan Stern (stern@rowland.harvard.edu) --- Can you try testing some kernel versions between 5.3 and 5.11 to see exactly at what point the problem was introduced?

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 20 Alan Stern 2021-11-01 19:41:07 UTC

There were no significant changes at all to the ehci-hcd driver between 5.6.1 and 5.7.1.  Which indicates that the cause of the problem lies somewhere else in the kernel.

At this point, your best approach would be to carry out a git bisect between those two kernel versions.  Or maybe just between 5.6 and 5.7 (I assume that 5.6 is okay, just like 5.6.1, and 5.7 is bad, just like 5.7.1).  That would let you identify the exact commit where the problem started.

Comment 21 Scott Arnold 2021-11-01 19:49:00 UTC

Hello,
I noticed there was not much diff in drivers/usb using meld on 5.6.1 and 5.7.1. Git not easy due to network restrictions.
I am going to see if 4.8.5 compiler vs 8.3.1 makes any difference.
Also going to check and see if any bios settings could re-arrange the IRQ's to work around problem.
Thanks
Scott


-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Monday, November 1, 2021 2:41 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7C5df722eafbb3426ea8e708d99d6f8ef1%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637713924742064413%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=hb1Mm3rBbLg%2FUZQIk8WY0Pr%2BSqo9XBRJBLS2vmapYio%3D&amp;reserved=0

--- Comment #20 from Alan Stern (stern@rowland.harvard.edu) --- There were no significant changes at all to the ehci-hcd driver between 5.6.1 and 5.7.1.  Which indicates that the cause of the problem lies somewhere else in the kernel.

At this point, your best approach would be to carry out a git bisect between those two kernel versions.  Or maybe just between 5.6 and 5.7 (I assume that
5.6 is okay, just like 5.6.1, and 5.7 is bad, just like 5.7.1).  That would let you identify the exact commit where the problem started.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 22 Scott Arnold 2021-11-05 17:10:05 UTC

Hello,
This caused problem:
https://patchwork.kernel.org/project/linux-pci/patch/20200214213313.66622-2-sean.v.kelley@linux.intel.com/
Scott

-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Monday, November 1, 2021 2:41 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7C5df722eafbb3426ea8e708d99d6f8ef1%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637713924742064413%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=hb1Mm3rBbLg%2FUZQIk8WY0Pr%2BSqo9XBRJBLS2vmapYio%3D&amp;reserved=0

--- Comment #20 from Alan Stern (stern@rowland.harvard.edu) --- There were no significant changes at all to the ehci-hcd driver between 5.6.1 and 5.7.1.  Which indicates that the cause of the problem lies somewhere else in the kernel.

At this point, your best approach would be to carry out a git bisect between those two kernel versions.  Or maybe just between 5.6 and 5.7 (I assume that
5.6 is okay, just like 5.6.1, and 5.7 is bad, just like 5.7.1).  That would let you identify the exact commit where the problem started.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 23 Scott Arnold 2021-11-05 18:56:22 UTC

pci=noioapicquirk fixes it.

-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Monday, November 1, 2021 2:41 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7C5df722eafbb3426ea8e708d99d6f8ef1%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637713924742064413%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=hb1Mm3rBbLg%2FUZQIk8WY0Pr%2BSqo9XBRJBLS2vmapYio%3D&amp;reserved=0

--- Comment #20 from Alan Stern (stern@rowland.harvard.edu) --- There were no significant changes at all to the ehci-hcd driver between 5.6.1 and 5.7.1.  Which indicates that the cause of the problem lies somewhere else in the kernel.

At this point, your best approach would be to carry out a git bisect between those two kernel versions.  Or maybe just between 5.6 and 5.7 (I assume that
5.6 is okay, just like 5.6.1, and 5.7 is bad, just like 5.7.1).  That would let you identify the exact commit where the problem started.

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Comment 24 Alan Stern 2021-11-05 19:39:46 UTC

On Fri, Nov 05, 2021 at 05:10:05PM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=214789
> 
> --- Comment #22 from Scott Arnold (scott.c.arnold@nasa.gov) ---
> Hello,
> This caused problem:
>
> https://patchwork.kernel.org/project/linux-pci/patch/20200214213313.66622-2-sean.v.kelley@linux.intel.com/
> Scott

This is commit b88bf6c3b6ff ("PCI: Add boot interrupt quirk mechanism 
for Xeon chipsets").

Sean and linux-pci readers, please take a look at this bug report 
(Bugzilla #214789).

Alan Stern

Comment 25 Krzysztof Wilczyński 2021-11-16 19:40:55 UTC

[+CC Sean using his other e-mail address]

Hi Alan,

> > https://bugzilla.kernel.org/show_bug.cgi?id=214789
> > 
> > --- Comment #22 from Scott Arnold (scott.c.arnold@nasa.gov) ---
> > Hello,
> > This caused problem:
> >
> https://patchwork.kernel.org/project/linux-pci/patch/20200214213313.66622-2-sean.v.kelley@linux.intel.com/
> > Scott
> 
> This is commit b88bf6c3b6ff ("PCI: Add boot interrupt quirk mechanism 
> for Xeon chipsets").
> 
> Sean and linux-pci readers, please take a look at this bug report 
> (Bugzilla #214789).

Sean might not have been able to see this message, as he is no longer
working for Intel and his old e-mail would bounce.

I am added Sean using a different one I've found, perhaps it would work.

	Krzysztof

Comment 26 Sean V Kelley 2021-12-04 00:02:50 UTC

Hi,

Thanks for reaching out Krzysztof.

So this platform is a Xeon E7-8867.  which makes it a broadwell based Xeon. 

I don't see it in the logs, what is the reported device ID?  Those Xeon have the capability disable the route to the ICH, which is these quirks are meant to work with.

It also looks like you are using an out of tree driver?  Is that correct?

Are you unable to use MSI with this device?

Sean

Comment 27 Sean V Kelley 2021-12-04 00:05:48 UTC

It just seems like I'm with this particular vendor card and out of tree driver, there is some sort of expectation that the IRQ will always be routed to the ICH/PCH, which the quirk is attempting to prevent to avoid disabling valid interrupts with the rerouting.  Hence the question on MSI.

Sean

Comment 28 Sean V Kelley 2021-12-04 00:27:48 UTC

(In reply to Scott Arnold from comment #23)
> pci=noioapicquirk fixes it.
> 

As mentioned in the patch originally:

https://patchwork.kernel.org/project/linux-pci/patch/20200214213313.66622-3-sean.v.kelley@linux.intel.com/

+The config option X86_REROUTE_FOR_BROKEN_BOOT_IRQS exists to enable
+(or disable) the redirection of the interrupt handler to the PCH interrupt
+line. The option can be overridden by either pci=ioapicreroute or
+pci=noioapicreroute.[3]


+[3] https://lore.kernel.org/lkml/487C8EA7.6020205@suse.de/

In the absence of the ability to convert your out-of-tree driver for this hardware to MSI, then can you give pci=noioapicreroute a try?


Sean

Comment 29 Sean V Kelley 2021-12-04 00:29:37 UTC

(In reply to Sean V Kelley from comment #28)
> (In reply to Scott Arnold from comment #23)
> > pci=noioapicquirk fixes it.
> > 
> 
> As mentioned in the patch originally:
> 
> https://patchwork.kernel.org/project/linux-pci/patch/20200214213313.66622-3-
> sean.v.kelley@linux.intel.com/
> 
> +The config option X86_REROUTE_FOR_BROKEN_BOOT_IRQS exists to enable
> +(or disable) the redirection of the interrupt handler to the PCH interrupt
> +line. The option can be overridden by either pci=ioapicreroute or
> +pci=noioapicreroute.[3]
> 
> 
> +[3] https://lore.kernel.org/lkml/487C8EA7.6020205@suse.de/
> 
> In the absence of the ability to convert your out-of-tree driver for this
> hardware to MSI, then can you give pci=noioapicreroute a try?

Sorry, full of typos today.  I'm wondering if you could give pci=ioapicreroute a try.

I no longer work at Intel but am happy to help try to resolve your issue as best I can.

Best regards,

Sean
> 
> 
> Sean

Comment 30 Scott Arnold 2021-12-06 21:44:00 UTC

Hello,
The device does not appear to support MSI.
pci_enable_msi and pci_alloc_irq_vectors fail.

From lspci (devices uses a bridge chip):

87:00.0 PCI bridge: PLX Technology, Inc. PEX8112 x1 Lane PCI Express-to-PCI Bridge (rev aa) (prog-if 00 [Normal decode])
        Physical Slot: 5
        Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 2
        Bus: primary=87, secondary=88, subordinate=88, sec-latency=64
        I/O behind bridge: 0000a000-0000afff
        Memory behind bridge: ca100000-ca1fffff
        Capabilities: [40] Power Management version 2
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
        Capabilities: [60] Express PCI-Express to PCI/PCI-X Bridge, MSI 00
        Capabilities: [100] Power Budgeting <?>

88:00.0 Signal processing controller: Datum Inc. Bancomm-Timing Division Device 4013 (rev 20)
        Subsystem: Datum Inc. Bancomm-Timing Division Device 4013
        Flags: medium devsel, IRQ 16, NUMA node 2
        I/O ports at a000 [size=128]
        Memory at ca101000 (32-bit, non-prefetchable) [size=2K]
        Memory at ca100000 (32-bit, non-prefetchable) [size=2K]

This is a custom driver for our simulation environment.

I will have to schedule time on the machine to try pci=noioapicreroute.

Scott

-----Original Message-----
From: bugzilla-daemon@bugzilla.kernel.org <bugzilla-daemon@bugzilla.kernel.org> 
Sent: Friday, December 3, 2021 6:03 PM
To: Arnold, Scott C. (JSC-CD13)[SGT, INC] <scott.c.arnold@nasa.gov>
Subject: [EXTERNAL] [Bug 214789] ehci-hcd.c ISR

https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D214789&amp;data=04%7C01%7Cscott.c.arnold%40nasa.gov%7C2cd09dc8c53342df61ca08d9b6b96c85%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637741729773069186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=EPsPKGihdkKCCxRW2472a6LNLKJ3q1t%2FSZjNKEBUoj0%3D&amp;reserved=0

Sean V Kelley (seanvk.dev@oregontracks.org) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |seanvk.dev@oregontracks.org

--- Comment #26 from Sean V Kelley (seanvk.dev@oregontracks.org) --- Hi,

Thanks for reaching out Krzysztof.

So this platform is a Xeon E7-8867.  which makes it a broadwell based Xeon. 

I don't see it in the logs, what is the reported device ID?  Those Xeon have the capability disable the route to the ICH, which is these quirks are meant to work with.

It also looks like you are using an out of tree driver?  Is that correct?

Are you unable to use MSI with this device?

Sean

--
You may reply to this email to add a comment.

You are receiving this mail because:
You reported the bug.

Note You need to log in before you can comment on or make changes to this bug.