Swarms of lost messages from export machines

bzi...@uw.edu

unread,

May 13, 2021, 3:56:08 PM5/13/21

to Earthworm Community Forum

(Warning: kinda lengthy message)

Lately, we have been receiving swarms of emails in the hundreds every so often for lost messages and out of sequence packets. It happens every several days to every couple of weeks. It's a big flurry of emails all at once and then it subsides. I have been trying to sleuth them out over the past few months when they occur but I'm not entirely sure what is happening. I feel like I'm running in circles getting thrown red herrings. Here is a few snapshots from some of the logs I have.

From coaxtoring.d

20210513_UTC_18:30:10 coaxtoring: High mark:157391 buffers

20210513_UTC_18:35:10 coaxtoring: High mark:157403 buffers

20210513_UTC_18:40:10 coaxtoring: High mark:157527 buffers

20210513_UTC_18:45:10 coaxtoring: High mark:157587 buffers

20210513_UTC_18:50:11 coaxtoring: High mark:157688 buffers

20210513_UTC_18:54:15 inst:2 mid:31 typ:19 seq:236 4 lost messages

20210513_UTC_18:54:15 inst:2 mid:31 typ:19 seq:241 2 lost messages

20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:178 1 lost message

20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:183 3 lost messages

20210513_UTC_18:54:15 inst:38 mid:255 typ:19 seq:54 1 lost message

20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:185 1 lost message

20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:187 1 lost message

20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:190 2 lost messages

20210513_UTC_18:54:15 inst:2 mid:34 typ:19 seq:6 13 lost messages

20210513_UTC_18:54:15 inst:2 mid:34 typ:19 seq:9 2 lost messages

20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:195 4 lost messages

20210513_UTC_18:54:15 inst:2 mid:33 typ:19 seq:200 2 lost messages

20210513_UTC_18:54:15 inst:2 mid:34 typ:19 seq:13 3 lost messages

20210513_UTC_18:54:15 inst:38 mid:255 typ:19 seq:56 1 lost message

20210513_UTC_18:54:15 inst:2 mid:31 typ:19 seq:243 1 lost message

20210513_UTC_18:55:11 coaxtoring: High mark:157790 buffers

20210513_UTC_19:00:11 coaxtoring: High mark:157903 buffers

20210513_UTC_19:05:11 coaxtoring: High mark:157966 buffers

20210513_UTC_19:10:11 coaxtoring: High mark:158070 buffers

20210513_UTC_19:15:11 coaxtoring: High mark:158202 buffers

20210513_UTC_19:20:11 coaxtoring: High mark:158239 buffers

20210513_UTC_19:25:11 coaxtoring: High mark:158366 buffers

20210513_UTC_19:30:11 coaxtoring: High mark:158384 buffers

20210513_UTC_19:35:11 coaxtoring: High mark:158392 buffers

I'm not entirely sure what these high marks really mean. High to what?

Short snippet from statmgr.d

UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m33 t19 in WAVE_RING

UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i38 m255 t19 in WAVE_RING

UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m33 t19 in WAVE_RING

UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m34 t19 in WAVE_RING

UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m33 t19 in WAVE_RING

UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m34 t19 in WAVE_RING

UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i38 m255 t19 in WAVE_RING

UTC_Thu May 13 18:54:15 2021 ewserver1/export_scn missed msg(s) i2 m31 t19 in WAVE_RING

I do see a common issue with mid 33 and 34, which are two of our larger import machines holding the bulk of our stations. Both are running wftimefilter.d module. On mid 33 I see this in the log

20210513_UTC_18:54:39 wftimefilter: Saw sequence# gap for logo (i2 m151 t19 s0)

20210513_UTC_18:54:39 UNKN.SDG..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SBT..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SRD..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SWR..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SCA..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SCS..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SPK..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SCE..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SPO..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SNI..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SDT..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.STH..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SMD..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SCT..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SSQ..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SCK..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SIO..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SDL..-- 1620931830.000 250.000 s gap detected

20210513_UTC_18:54:39 UNKN.SSL..-- 1620931830.000 250.000 s gap detected

20210513_UTC_19:08:49 wftimefilter: Saw sequence# gap for logo (i2 m151 t19 s0)

Is UNKN a placeholder? We have no station named UNKN. I do see that a lot throughout the log for the day. When I do a sniffwave for this particular station, it does come up with a [D:-0.40s F: 0.0s] which would indicate out of sequence, but also shows its mid as 33 which is the import machine.

In the statmgr.d log I see a lot of q330 flapping on this machine. Modules dying and being restarted, accompanied with missed messages from them. Could any of these stations be getting the UNKN placeholder stnid? The SOH channel names indicate a q330. When checking the tanks for the few suspects, none have any data from at least 3-4 hrs prior to these events.

The overall puzzling thing is that originally, the messages from our exports indicated missed messages from multiple different waveserverV's on the export (we have 8 running). I almost feel like going into mid 33 and 34 is somewhat a red herring in the sense that some of the waveserverV's are not storing data from those mid's, yet they are also pushing out these messages. The only other thought I can think of is that mid 33 and 34 are flooding the ring that feeds our export machine and thus creating a problem downstream for other data that really isn't having an issue, or there is something else interfering with the process.

Any thoughts or insight would be appreciated.

Thank you

Brett Zimmermann

Dietz, Lynn D

unread,

May 13, 2021, 4:24:06 PM5/13/21

to Earthworm Community Forum

Basically, it looks like your coaxtoring is bogging down and not keeping up with pulling the broadcast data off the network. Are you running it at "realtime" priority? On our systems coaxtoring is the highest priority process.

"High mark" means the most RcvBuf slots filled in the most recent BufferReportInt seconds. From Menlo'scoaxtoring config:

BufferReportInt 600 # Print high buffer count every n seconds

RcvBufSize 260000 # Size of socket receive buffer, in bytes.

# Set RcvBufSize to a large number to avoid

# dropping packets.

And if coaxtoring misses a message, lots of other modules will complain about a "lost msg" (if they are paying attention to the instid,modid,type of the mssed msg).

Regarding UNKN...seems like you might be importing some data you are not expecting, perhaps from an unconfigured- or mis-configured logger? Did you do a sniffwave on your import machine to see what the instid,moduleid of those UNKN channels are? Maybe you can track its source down from that. But sure seems like bogus data to get rid of.

Regarding modules restarting...every time a module starts up, it resets its sequence number to zero. If other modules are watching the output of that restarting module, you may get msgs like:

Saw sequence# gap for logo (i2 m151 t19 s0) <- where s0 is sequence#0 .... usually a module restart.

-Lynn Dietz

From: earthwo...@googlegroups.com <earthwo...@googlegroups.com> on behalf of bzi...@uw.edu <bzi...@uw.edu>
Sent: Thursday, May 13, 2021 12:56 PM
To: Earthworm Community Forum <earthwo...@googlegroups.com>
Subject: [EXTERNAL] [Earthworm Forum] Swarms of lost messages from export machines

This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.

--
--
You received this message because you are subscribed to the Google
Groups "Earthworm Community Forum" group.

To post to this group, send an email to earthwo...@googlegroups.com

To unsubscribe from this group, send an email to
earthworm_for...@googlegroups.com

For more options, visit this group at
http://groups.google.com/group/earthworm_forum?hl=en

---
You received this message because you are subscribed to the Google Groups "Earthworm Community Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to earthworm_for...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/earthworm_forum/9637a98b-9daa-4f02-a765-ac845006573en%40googlegroups.com.

Baker, Lawrence M

unread,

May 13, 2021, 4:35:32 PM5/13/21

to Brett Zimmermann, Baker, Lawrence M, 'Meremonte, Mark E' via Earthworm Community Forum

Brett,

The questions you raise here are what I might suspect.

In the statmgr.d log I see a lot of q330 flapping on this machine. Modules dying and being restarted, accompanied with missed messages from them. Could any of these stations be getting the UNKN placeholder stnid? The SOH channel names indicate a q330. When checking the tanks for the few suspects, none have any data from at least 3-4 hrs prior to these events.

The overall puzzling thing is that originally, the messages from our exports indicated missed messages from multiple different waveserverV's on the export (we have 8 running). I almost feel like going into mid 33 and 34 is somewhat a red herring in the sense that some of the waveserverV's are not storing data from those mid's, yet they are also pushing out these messages. The only other thought I can think of is that mid 33 and 34 are flooding the ring that feeds our export machine and thus creating a problem downstream for other data that really isn't having an issue, or there is something else interfering with the process.

My suspicion was that you have some wildcards in your config files and a new or rogue station (the one with the UNKN stnid?) is stuffing data into your system. I don't know how, though.

Earthworm has no metering built in to throttle bursts of data that can overwhelm the capacity of the ring buffers. Perhaps you have network connectivity problems to those stations that have no data for 3-4 hours, then they blast all the data they've been queueing. Try to record a tcpdump capture over the time this occurs. Try to find all the IP addresses that are sending data to your Earthworm system. If they are all IP addresses you expect, then look at the timestamps of the incoming packets for any pattern of interruptions. I don't know of any tcpdump plotting programs, but a visual picture of arrival times or inter-arrival times would help.

Anyway, my guess is, as long as you are not getting hammered by packets that don't belong there, you likely have flaky IP connections with very bursty behavior. As far as I know, Earthworm is susceptible to failures when that happens, because it has no flow control features that I know of. I don't know of any fix, other than to increase the amount of buffering everywhere along the way.

Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov

On May 13 2021, at 12:56:08 PM, bzi...@uw.edu <bzi...@uw.edu> wrote:

This email has been received from outside of DOI - Use caution before clicking on links, opening attachments, or responding.

bzi...@uw.edu

unread,

May 14, 2021, 12:09:23 PM5/14/21

to Earthworm Community Forum

Thanks for the responses. After some more sleuthing I figured out the UNKN issue, really boils down to a station recently upgraded to a q330s+ where the telemetry was assumed solid, apparently not so much. When it does connect, it sends these blasts of data before it disconnects again. The logs didn't really point me in the right direction aside from the soh channel names being indicative of a q330, but some other odd behavior from a few stations keyed me in and ended up stopping the modules, waiting, then restarting one at a time and watch the logs.

As for the buffer size thing, we had just recently increased the buffer size a few months ago. With some further troubleshooting from Victor Kress, it seems to be more of a read/write issue. The data is just not writing as fast as it's coming in. Could be a hardware limitation, could be overpopulated waveserverV's. Will be doing some further digging.

Thanks

Reply all

Reply to author

Forward