Start time: around 2022-10-03 17:20:00 UTC

Day 1

Timeline

Oct 03 17:21:55

bacalhau-vm-production-0 stopped bidding or processing jobs. The only events are constructing new jobs which are triggered localDB eventhandler

Logs:

Oct 03 17:21:55 bacalhau-vm-production-0 bash[19523]: 17:21:55.17 | DBG job/factory.go:14 > Constructing job from event: 16f10924-e7c4-4572-9048-0533639789ba
Oct 03 17:22:25 bacalhau-vm-production-0 bash[19523]: 17:22:25.574 | DBG job/factory.go:14 > Constructing job from event: eecbbc07-fecb-4c59-9ec6-4bfc4791846c
Oct 03 17:22:26 bacalhau-vm-production-0 bash[19523]: 17:22:26.085 | DBG job/factory.go:14 > Constructing job from event: a6033b77-70d3-4106-b89a-ff225e76366d
...
...
Oct 03 18:09:09 bacalhau-vm-production-0 bash[19523]: 18:09:09.039 | DBG job/factory.go:14 > Constructing job from event: 1f5b3bda-f5f6-4cbd-a065-c1f3bbd1e65f
Oct 03 18:09:09 bacalhau-vm-production-0 bash[19523]: 18:09:09.628 | DBG job/factory.go:14 > Constructing job from event: deccb239-fce4-4097-a77d-89ffd389ce9f

**Analysis:**Successful handling of JobEventCreated by localDB eventhandler should result is the event reaching computenode, and logging something similar to the line below, but that is not happening:

17:15:10.225 | DBG computenode/computenode.go:261 > [QmXaXu9N5GNetatsvwnTfQqNtSeKAD6uCmarbh3LMRYAcF] job created: 0fd27b03-787a-4d7e-bb13-780d22822e7e

My initial guess is that we have a deadlock when accessing localdb as not a single event gets passed localdb eventhandler

Oct 03 17:21:17

Going back few seconds, these are the logs from the latest job that was partially processed by this node:

Logs

Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.16 | DBG job/factory.go:14 > Constructing job from event: 4d85b011-b7d1-4ae9-b860-d497f063d7ce
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.166 | DBG job/factory.go:14 > Constructing job from event: 4d85b011-b7d1-4ae9-b860-d497f063d7ce
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.177 | DBG computenode/computenode.go:261 > [QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL] job created: 4d85b011-b7d1-4ae9-b860-d497f063d7ce
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.192 | DBG computenode/shard_fsm.go:293 > [QmdZQ7Zb] shard: 4d85b011-b7d1-4ae9-b860-d497f063d7ce:0 at state: InitialState transitioning from InitialState -> Enqueued
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.247 | DBG computenode/computenode.go:182 > Found 1 BidShards => Starting loop
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.247 | DBG computenode/computenode.go:203 > Processing BidShard - ctx => context.Background.WithCancel.WithValue(type baggage.baggageContextKeyType, val <not Stringer>).WithValue(type trace.traceContextKeyType, val <not Stringer>).WithValue(type trace.traceContextKeyType, val <not Stringer>).WithCancel.WithValue(type trace.tra
ceContextKeyType, val <not Stringer>).WithValue(type baggage.baggageContextKeyType, val <not Stringer>).WithValue(type trace.traceContextKeyType, val <not Stringer>).WithValue(type baggage.baggageContextKeyType, val <not Stringer>).WithValue(type baggage.baggageContextKeyType, val <not Stringer>)
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.247 | DBG computenode/computenode.go:204 > Processing BidShard - baggage => environment=production,nodeid=QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL,jobid=4d85b011-b7d1-4ae9-b860-d497f063d7ce
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.255 | DBG computenode/computenode.go:562 > Compute node QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL bidding on: 4d85b011-b7d1-4ae9-b860-d497f063d7ce:0
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.256 | DBG inmemory/inmemory.go:113 > querying for single job 4d85b011-b7d1-4ae9-b860-d497f063d7ce
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 2022/10/03 17:21:17 ComputeNode.bidMu violation CRITICAL section took 11.750746ms 11750746 (threshold 10ms)
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.347 | DBG computenode/computenode.go:182 > Found 1 BidShards => Starting loop
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.347 | DBG computenode/computenode.go:203 > Processing BidShard - ctx => context.Background.WithCancel.WithValue(type baggage.baggageContextKeyType, val <not Stringer>).WithValue(type trace.traceContextKeyType, val <not Stringer>).WithValue(type trace.traceContextKeyType, val <not Stringer>).WithCancel.WithValue(type trace.tra
ceContextKeyType, val <not Stringer>).WithValue(type baggage.baggageContextKeyType, val <not Stringer>).WithValue(type trace.traceContextKeyType, val <not Stringer>).WithValue(type baggage.baggageContextKeyType, val <not Stringer>).WithValue(type baggage.baggageContextKeyType, val <not Stringer>)
Oct 03 17:21:17 bacalhau-vm-production-0 bash[19523]: 17:21:17.347 | DBG computenode/computenode.go:204 > Processing BidShard - baggage => nodeid=QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL,jobid=4d85b011-b7d1-4ae9-b860-d497f063d7ce,environment=production

AnalysisWhat to notice is that we don’t an fsm log stating that the job went from Enqueued to Bidding state, which we should expect after this line:

computenode/computenode.go:562 > Compute node QmdZQ7ZbhnvWY1J12XYKGHApJ6aufKyLNSvf8jZBrBaAVL bidding on: 4d85b011-b7d1-4ae9-b860-d497f063d7ce:0

Also, we shoulde expect another bidding on log right after the latest Processing BidShard... log statement.

Lately, we see a log line indicating a list API request for the job while it was being processed:

inmemory/inmemory.go:113 > querying for single job 4d85b011-b7d1-4ae9-b860-d497f063d7ce

Outcome (so far)