Troubleshoot deadline-exceeded error
All requests made to the Temporal Cluster by the Client or Worker are gRPC requests.
Sometimes, when these frontend requests can't be completed, you'll see this particular error message: Context: deadline exceeded
.
Network interruptions, timeouts, server overload, and Query errors are some of the causes of this error.
The following sections discuss the nature of this error and how to troubleshoot it.
Check system clocks
Timing skew can cause the system clock on a Worker to drift behind the system clock of the Temporal Cluster.
If the difference between the two clocks exceeds an Activity's Start-To-Close Timeout, an Activity complete after timeout
error occurs.
If you receive an Activity complete after timeout
error alongside Context: deadline exceeded
, check the clocks on the Temporal Cluster's system and the system of the Worker sending that error.
If the Worker's clock doesn't match the Temporal Cluster, synchronize all clocks to an NTP server.
Check Frontend Service logs
Cloud users cannot access some of the logs needed to diagnose the source of the error.
If you're using Temporal Cloud, create a support ticket with as much information as possible, including the Namespace Name and the Workflow Ids of some Workflow Executions in which the issue occurs.
Frontend Service logs can show which parts of the Cluster aren't working. For the error to appear, a service pod or container must be up and running.
OSS users can verify that the Frontend Service is connected and running by using the Temporal CLI.
temporal operator cluster health --address 127.0.0.1:7233
Use grpc-health-probe
to check the Frontend Service, Matching Service, and History Service.
./grpc-health-probe -addr=frontendAddress:frontendPort -service=temporal.api.workflowservice.v1.WorkflowService
./grpc-health-probe -addr=matchingAddress:matchingPort -service=temporal.api.workflowservice.v1.MatchingService
./grpc-health-probe -addr=historyAddress:historyPort -service=temporal.api.workflowservice.v1.HistoryService
Logs can also be used to find failed Client Query requests.
Check your Cluster metrics
Cluster metrics can be used to detect issues (such as resource exhausted
) that impact Cluster health.
A resource exhausted
error can cause your client request to fail, which prompts the deadline exceeded
error.
Use the following query to check for errors in RpsLimit
, ConcurrentLimit
and SystemOverloaded
on your metrics dashboard.
sum(rate(service_errors_resource_exhausted{}[1m])) by (resource_exhausted_cause)
Look for high latencies, short timeouts, and other abnormal Cluster metrics. If the metrics come from a specific service (such as History Service), check the service's health and performance.
Check Workflow logic
Check your Client and Worker configuration files for missing or invalid target values, such as the following:
- Server names
- Network or host addresses
- Certificates
Invalid targets also cause connection refused
errors alongside deadline exceeded
.
Check that the Client connects after updating your files.
Advanced troubleshooting
In addition to the steps listed in the previous sections, check the areas mentioned in each of the following scenarios.
After enabling mTLS
Check the health of the Cluster with temporal operator cluster health
.
temporal operator cluster health --address [SERVER_ADDRESS]
Add any missing environment variables to the configuration files, and correct any incorrect values. Server names and certificates must match between Frontend and internode.
After restarting the Temporal Cluster
You might not be giving the Cluster enough time to respond and reconnect. Restart the Server, wait, and then check all services for connectivity and further errors.
If the error persists, review your Workflow Execution History and server logs for more specific causes before continuing to troubleshoot.
When executing or scheduling Workflows
One or more services might be unable to connect to the Frontend Service. The Workflow might be unable to complete requests within the given connection time.
Increase the value of frontend.keepAliveMaxConnectionAge
so that requests can be finished before the connection terminates.
If you increase frontend.keepAliveMaxConnectionAge
values, consider monitoring your server performance for load.
Still unable to resolve your issue?
- If you use Temporal Cloud, create a support ticket.
- If you use our open source software or Temporal Cloud, check for similar questions and possible solutions in our community forum or community Slack.