Debugging Cloud TPU VMs
This document describes how to use the cloud-tpu-diagnostics PyPI package to generate stack traces for processes running in TPU VMs. This package dumps the Python traces when a fault occurs, for example segmentation faults, floating-point exceptions, or illegal operation exceptions. Additionally, it also periodically collects stack traces to help you debug situations when the program is unresponsive.
To use the cloud-tpu-diagnostics PyPI package, you must install it by running pip install cloud-tpu-diagnostics on all TPU VMs. You can do this with one gcloud compute tpus tpu-vm ssh command. For example:
gcloud compute tpus tpu-vm ssh you-tpu-name \ --zone=your-zone \ --project=your-project-name \ --worker=all \ --command="pip install cloud-tpu-diagnostics"
You must also add the following code to your scripts running on all TPU VMs.
from cloud_tpu_diagnostics import diagnostic from cloud_tpu_diagnostics.configuration import debug_configuration from cloud_tpu_diagnostics.configuration import diagnostic_configuration from cloud_tpu_diagnostics.configuration import stack_trace_configuration stack_trace_config = stack_trace_configuration.StackTraceConfig( collect_stack_trace = True, stack_trace_to_cloud = True) debug_config = debug_configuration.DebugConfig( stack_trace_config = stack_trace_config) diagnostic_config = diagnostic_configuration.DiagnosticConfig( debug_config = debug_config) By default, stack traces are collected every 10 minutes. You can change the duration between two stack trace collection events to 5 minutes, for example:
stack_trace_config = stack_trace_configuration.StackTraceConfig( collect_stack_trace = True, stack_trace_to_cloud = True, stack_trace_interval_seconds = 300) Wrap your main method with diagnose() to periodically collect the stack traces:
with diagnostic.diagnose(diagnostic_config): run_main() This configuration starts collecting stack traces inside the /tmp/debugging directory on each TPU VM. There is an agent running on all TPU VMs that uploads the traces from a temporary directory to Cloud Logging.