Lately, we have been receiving swarms of emails in the hundreds every so often for lost messages and out of sequence packets. It happens every several days to every couple of weeks. It's a big flurry of emails all at once and then it subsides. I have been trying to sleuth them out over the past few months when they occur but I'm not entirely sure what is happening. I feel like I'm running in circles getting thrown red herrings. Here is a few snapshots from some of the logs I have.
From coaxtoring.d
20210513_UTC_18:30:10 coaxtoring: High mark:157391 buffers
20210513_UTC_18:35:10 coaxtoring: High mark:157403 buffers
20210513_UTC_18:40:10 coaxtoring: High mark:157527 buffers
20210513_UTC_18:45:10 coaxtoring: High mark:157587 buffers
20210513_UTC_18:50:11 coaxtoring: High mark:157688 buffers
20210513_UTC_18:54:15 inst:2 mid:31 typ:19 seq:236 4 lost messages
20210513_UTC_18:54:15 inst:2 mid:31 typ:19 seq:241 2 lost messages
20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:178 1 lost message
20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:183 3 lost messages
20210513_UTC_18:54:15 inst:38 mid:255 typ:19 seq:54 1 lost message
20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:185 1 lost message
20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:187 1 lost message
20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:190 2 lost messages
20210513_UTC_18:54:15 inst:2 mid:34 typ:19 seq:6 13 lost messages
20210513_UTC_18:54:15 inst:2 mid:34 typ:19 seq:9 2 lost messages
20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:195 4 lost messages
20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:200 2 lost messages
20210513_UTC_18:54:15 inst:2 mid:34 typ:19 seq:13 3 lost messages
20210513_UTC_18:54:15 inst:38 mid:255 typ:19 seq:56 1 lost message
20210513_UTC_18:54:15 inst:2 mid:31 typ:19 seq:243 1 lost message
20210513_UTC_18:55:11 coaxtoring: High mark:157790 buffers
20210513_UTC_19:00:11 coaxtoring: High mark:157903 buffers
20210513_UTC_19:05:11 coaxtoring: High mark:157966 buffers
20210513_UTC_19:10:11 coaxtoring: High mark:158070 buffers
20210513_UTC_19:15:11 coaxtoring: High mark:158202 buffers
20210513_UTC_19:20:11 coaxtoring: High mark:158239 buffers
20210513_UTC_19:25:11 coaxtoring: High mark:158366 buffers
20210513_UTC_19:30:11 coaxtoring: High mark:158384 buffers
20210513_UTC_19:35:11 coaxtoring: High mark:158392 buffers
I'm not entirely sure what these high marks really mean. High to what?
Short snippet from statmgr.d
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m33 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m33 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i38 m255 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m33 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m33 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m33 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m34 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m34 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m33 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m33 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m34 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i38 m255 t19 in WAVE_RING
UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m31 t19 in WAVE_RING
I do see a common issue with mid 33 and 34, which are two of our larger import machines holding the bulk of our stations. Both are running wftimefilter.d module. On mid 33 I see this in the log
20210513_UTC_18:54:39 wftimefilter: Saw sequence# gap for logo (i2 m151 t19 s0)
20210513_UTC_18:54:39 UNKN.SDG..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SBT..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SRD..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SWR..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SCA..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SCS..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SPK..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SCE..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SPO..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SNI..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SDT..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.STH..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SMD..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SCT..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SSQ..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SCK..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SIO..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SDL..-- 1620931830.000 250.000 s gap detected
20210513_UTC_18:54:39 UNKN.SSL..-- 1620931830.000 250.000 s gap detected
20210513_UTC_19:08:49 wftimefilter: Saw sequence# gap for logo (i2 m151 t19 s0)
Is UNKN a placeholder? We have no station named UNKN. I do see that a lot throughout the log for the day. When I do a sniffwave for this particular station, it does come up with a [D:-0.40s F: 0.0s] which would indicate out of sequence, but also shows its mid as 33 which is the import machine.
In the statmgr.d log I see a lot of q330 flapping on this machine. Modules dying and being restarted, accompanied with missed messages from them. Could any of these stations be getting the UNKN placeholder stnid? The SOH channel names indicate a q330. When checking the tanks for the few suspects, none have any data from at least 3-4 hrs prior to these events.