Debugging Cloud TPU VMs

This document describes how to use the cloud-tpu-diagnostics PyPI package to generate stack traces for processes running in TPU VMs. This package dumps the Python traces when a fault occurs, for example segmentation faults, floating-point exceptions, or illegal operation exceptions. Additionally, it also periodically collects stack traces to help you debug situations when the program is unresponsive.

To use the cloud-tpu-diagnostics PyPI package, you must install it by running pip install cloud-tpu-diagnostics on all TPU VMs. You can do this with one gcloud compute tpus tpu-vm ssh command. For example:

 gcloud compute tpus tpu-vm ssh you-tpu-name \  --zone=your-zone \  --project=your-project-name \  --worker=all \  --command="pip install cloud-tpu-diagnostics"

You must also add the following code to your scripts running on all TPU VMs.

from cloud_tpu_diagnostics import diagnostic from cloud_tpu_diagnostics.configuration import debug_configuration from cloud_tpu_diagnostics.configuration import diagnostic_configuration from cloud_tpu_diagnostics.configuration import stack_trace_configuration stack_trace_config = stack_trace_configuration.StackTraceConfig( collect_stack_trace = True, stack_trace_to_cloud = True) debug_config = debug_configuration.DebugConfig( stack_trace_config = stack_trace_config) diagnostic_config = diagnostic_configuration.DiagnosticConfig( debug_config = debug_config) 

By default, stack traces are collected every 10 minutes. You can change the duration between two stack trace collection events to 5 minutes, for example:

stack_trace_config = stack_trace_configuration.StackTraceConfig( collect_stack_trace = True, stack_trace_to_cloud = True, stack_trace_interval_seconds = 300) 

Wrap your main method with diagnose() to periodically collect the stack traces:

with diagnostic.diagnose(diagnostic_config): run_main() 

This configuration starts collecting stack traces inside the /tmp/debugging directory on each TPU VM. There is an agent running on all TPU VMs that uploads the traces from a temporary directory to Cloud Logging.