Python Subprocess with Open Telemetry

· 646 words · 4 minute read

I have a python service which has been running in an AWS ECS Task without issue. However, if I try to run it locally, the service would deadlock (hang, freeze, stop working…adding more terms for searchability, because I was having a heck of a time finding the right words to search for). My local machine runs MacOS, my container image runs Wolfi (Linux). (I also have a personal machine running Ubuntu, and I could not reproduce the issue on that either).

I’ll skip over the hours and hours of time I spent step-through debugging in the subprocess module and cut to the cause: forks.

I do not know enough about forking yet to be able to talk in depth about it. Here’s what I do know:

  • subprocess.Popen calls the operating system’s fork.
  • Something else in my program is also forking the process.
  • These were causing deadlocks.

In my case, the “something else” was the Open Telemetry opentelemetry.sdk.trace.export.BatchSpanProcessor. And, what do you know, this is a known issue, and it’s right there in the docs.

The BatchSpanProcessor is not fork-safe and doesn’t work well with application servers (Gunicorn, uWSGI) which are based on the pre-fork web server model. The BatchSpanProcessor spawns a thread to run in the background to export spans to the telemetry backend. During the fork, the child process inherits the lock which is held by the parent process and deadlock occurs. We can use fork hooks to get around this limitation of the span processor.

Please see http://bugs.python.org/issue6721 for the problems about Python locks in (multi)threaded context with fork.

There’s also this issue (#3307) which hopes to clarify the model a bit more clearly.

I have spent enough time trying to find the cause of the issue that I’m hesitant to try to come up with an actual solution, other than not using the BatchSpanProcessor locally. I need to get back to work.

However, to leave myself a note, I was able in some proof-of-concept work to work around the issue using multiprocessing pools with spawn instead of fork. I’m not sure yet how much work it would take to reorganize the problem code in my application to work with that, and if that actually solves the issue, or just hides it.

The concern that I have is that working around the problem, say, by using a ConsoleSpanProcessor when on MacOS and a BatchSpanProcessor on Linux, is that the problem may still exist on Linux, it’s just better hidden. However, in reality, that’s probably that path I’ll take so I can just move on with my life.

Minimum Viable Proof 🔗

Here’s the minimal script that reproduces the issue. Note that the command called in subprocess.Run does not matter (I spent a lot of time thinking it was the executable I was invoking that was the problem; it’s not).

Expect, on MacOS, for this to print Hello, World a few times before it deadlocks. The deadlock occurs the first time otel writes a batch of spans to the OTLP endpoint. Expect, on Linux, for there to be no issues.

pyproject.toml

[tool.poetry]
name = "are you forking kidding me"
version = "0.1.0"
description = ""
authors = []
readme = ""

[tool.poetry.dependencies]
python = "^3.11"
opentelemetry-sdk = "^1.22.0"
opentelemetry-api = "^1.22.0"
opentelemetry-exporter-otlp-proto-grpc = "^1.22.0"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"

docker-compose.yml

version: "3"
services:
  jaeger:
    image: jaegertracing/all-in-one:1.54
    ports:
      - "4317:4317"

main.py

import subprocess
import time

from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

tracer_provider = TracerProvider(
    resource=Resource(attributes={SERVICE_NAME: "foo"}),
)
tracer_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(tracer_provider)

tracer = trace.get_tracer(__name__)

def main():
    while True:
        with tracer.start_as_current_span("bar"):
            subprocess.run(["echo", "Hello, World"])
            time.sleep(1)

if __name__ == "__main__":
    main()
docker-compose up -d
poetry install
poetry run python main.py

Further Reading 🔗