I have a python service which has been running in an AWS ECS Task without issue. However, if I try to run it locally, the service would deadlock (hang, freeze, stop working…adding more terms for searchability, because I was having a heck of a time finding the right words to search for). My local machine runs MacOS, my container image runs Wolfi (Linux). (I also have a personal machine running Ubuntu, and I could not reproduce the issue on that either).
I’ll skip over the hours and hours of time I spent step-through debugging in
the subprocess
module and cut to the cause: forks.
I do not know enough about forking yet to be able to talk in depth about it. Here’s what I do know:
subprocess.Popen
calls the operating system’sfork
.- Something else in my program is also forking the process.
- These were causing deadlocks.
In my case, the “something else” was the Open Telemetry
opentelemetry.sdk.trace.export.BatchSpanProcessor
. And, what do you know,
this is a known issue, and it’s right there in the
docs.
The BatchSpanProcessor is not fork-safe and doesn’t work well with application servers (Gunicorn, uWSGI) which are based on the pre-fork web server model. The BatchSpanProcessor spawns a thread to run in the background to export spans to the telemetry backend. During the fork, the child process inherits the lock which is held by the parent process and deadlock occurs. We can use fork hooks to get around this limitation of the span processor.
Please see http://bugs.python.org/issue6721 for the problems about Python locks in (multi)threaded context with fork.
There’s also this issue (#3307) which hopes to clarify the model a bit more clearly.
I have spent enough time trying to find the cause of the issue that I’m
hesitant to try to come up with an actual solution, other than not using the
BatchSpanProcessor
locally. I need to get back to work.
However, to leave myself a note, I was able in some proof-of-concept work to
work around the issue using multiprocessing
pools with spawn
instead of
fork
. I’m not sure yet how much work it would take to reorganize the problem
code in my application to work with that, and if that actually solves the
issue, or just hides it.
The concern that I have is that working around the problem, say, by using a
ConsoleSpanProcessor
when on MacOS and a BatchSpanProcessor
on Linux, is
that the problem may still exist on Linux, it’s just better hidden. However, in
reality, that’s probably that path I’ll take so I can just move on with my
life.
Minimum Viable Proof 🔗
Here’s the minimal script that reproduces the issue. Note that the command
called in subprocess.Run
does not matter (I spent a lot of time thinking it
was the executable I was invoking that was the problem; it’s not).
Expect, on MacOS, for this to print Hello, World
a few times before it
deadlocks. The deadlock occurs the first time otel writes a batch of spans to
the OTLP endpoint. Expect, on Linux, for there to be no issues.
pyproject.toml
[tool.poetry]
name = "are you forking kidding me"
version = "0.1.0"
description = ""
authors = []
readme = ""
[tool.poetry.dependencies]
python = "^3.11"
opentelemetry-sdk = "^1.22.0"
opentelemetry-api = "^1.22.0"
opentelemetry-exporter-otlp-proto-grpc = "^1.22.0"
[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
docker-compose.yml
version: "3"
services:
jaeger:
image: jaegertracing/all-in-one:1.54
ports:
- "4317:4317"
main.py
import subprocess
import time
from opentelemetry import trace
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
tracer_provider = TracerProvider(
resource=Resource(attributes={SERVICE_NAME: "foo"}),
)
tracer_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer(__name__)
def main():
while True:
with tracer.start_as_current_span("bar"):
subprocess.run(["echo", "Hello, World"])
time.sleep(1)
if __name__ == "__main__":
main()
docker-compose up -d
poetry install
poetry run python main.py