Tender: Discussion

Why was tender down for several days?

2022-12-12T11:07:02Z

Hi Neil, one of our load balancers’ ethernet connection became unreliable
for unknown reasons and was up and down for about a day, and instead of the
automatic IP failover kicking in as should happen; it didn’t trigger
because the machine was still up and running, just had a flaky connection
to the internet but not to the local subnet or the monitoring system.

We had some alerts but it seemed to be working when we checked it manually,
which is not out of the ordinary. There are periodic ddos attempts which
often render the same sorts of alerts as the systems kick in.

Once the weekend was over here we dug into it as part of monday morning
roundups and saw that traffic was way down, and found that one port on the
switch was misbehaving, so we forced a failover to the secondary load
balancer, reconfigured the networking on that machine, had the data center
switch the cables in case that was the issue, and re-sent all the inbound
emails that had queued up due to the connectivity issue.

The failover script, which hadn’t run for over six years, has been tweaked
to be more aggressive in pruning a misbehaving primary IP, there’s a new
cable, and we’re looking at ways of improving the monitoring, including
pings from around the world so it captures certain routes outside the
primary path.

we still don’t actually have a root cause as to why the original issue
happened so not much point in talking about it yet! trying to mail it down
between the switch and our colo host’s ddos protection.

Why was tender down for several days?

2022-12-12T11:32:00Z