How Duolingo Cut Onboarding Time by 40% with Temporal Nexus
— 8 min read
The Onboarding Bottleneck: A Real-World Pipeline Failure
When a fresh Duolingo engineer submitted their first pull request, the CI pipeline stalled on a flaky integration test, causing a three-day delay before any feedback was received. The failure was not isolated; the same test intermittently failed for other teammates, inflating the average time from commit to merge by 2.8 days.
Internal metrics from the past six months show that 27% of new-hire PRs encountered at least one pipeline timeout, and the median onboarding cycle stretched to 7.1 days. This lag rippled into sprint velocity, as feature owners waited for approvals while the new engineer remained idle.
Root-cause analysis traced the issue to a monolithic Bash script that orchestrated build, test, and deployment steps. The script lacked retry logic, had hard-coded secrets, and offered no visibility into the state of each activity. When a transient network glitch hit the artifact repository, the script exited without cleanup, leaving dangling containers that blocked subsequent runs.
From a developer experience standpoint, the broken pipeline eroded confidence. New engineers reported feeling "stuck" and "unproductive," a sentiment echoed in the quarterly engagement survey where onboarding satisfaction dropped from 84% to 62%.
Team leads responded by logging a high-priority ticket to redesign the automation layer, citing the need for a durable, observable workflow engine that could survive transient failures without manual intervention.
In the weeks that followed, the engineering org evaluated several orchestration platforms, ultimately selecting Temporal Nexus for its stateful workflow capabilities and native support for long-running activities.
That decision set the stage for a systematic overhaul - one that would turn a three-day pain point into a measurable productivity win.
Key Takeaways
- Flaky CI steps can add up to 3 days to a new hire's onboarding timeline.
- Monolithic scripts lack resilience and observability, leading to repeated failures.
- Choosing a workflow engine with durable state helps eliminate transient bottlenecks.
With the problem defined, the next question was simple: which tool could give the team a reliable, observable backbone without forcing a rewrite in a brand-new language? The answer arrived in the form of Temporal Nexus.
Why Temporal Nexus Became Duolingo’s Automation Backbone
Temporal Nexus provides durable, stateful workflows that persist across process restarts, meaning a failed activity can be retried without re-executing the entire pipeline. This property directly addressed Duolingo’s flaky test scenario, where a single retry could resolve a network hiccup.
The platform’s built-in activity timeout handling allowed engineers to specify a 10-minute window for external service calls, after which Temporal automatically re-queues the activity. In practice, this reduced the average number of manual reruns per PR from 1.9 to 0.3.
Observability is baked in: each workflow emits a timeline event that appears in the Temporal Web UI, giving developers a single source of truth for pipeline state. The UI showed a 78% drop in "unknown" status reports after migration.
Duolingo also valued Temporal’s language-agnostic SDKs. The existing CI codebase was primarily in Python, and the team leveraged the Temporal Python SDK to model activities without rewriting large portions of the pipeline.
Cost considerations played a role. Temporal’s serverless-compatible deployment allowed Duolingo to run the service on its existing Kubernetes cluster, avoiding additional cloud spend. The monthly operational overhead fell to under $2,000, a fraction of the previous on-call labor cost.
Security was another driver. Temporal stores workflow state encrypted at rest, and activity workers run in isolated containers, satisfying Duolingo’s compliance requirements for handling user-generated content.
Finally, the platform’s signal feature - essentially a live-wire into a running workflow - gave auditors a way to pause a deployment on demand, a compliance win that would have been impossible with a static Bash script.
All of these factors converged in early 2024, making Temporal Nexus the clear choice for a resilient automation backbone.
Having settled on the technology, the engineering team faced the real work: translating a tangled script into a clean, observable workflow.
Building the CI/CD Workflow with Temporal
Each stage of the pipeline - checkout, compile, unit test, integration test, and deploy - was refactored into a Temporal activity. The workflow definition orchestrates these activities sequentially, with explicit retry policies attached to each.
For example, the integration test activity includes a retry policy of three attempts with exponential backoff. The code snippet below shows the Python definition:
@workflow.defn
class BuildWorkflow:
@workflow.run
async def run(self, repo_url: str, commit_sha: str):
await workflow.execute_activity(checkout_code, repo_url, commit_sha, retry_policy=RetryPolicy(max_attempts=2))
await workflow.execute_activity(compile, timeout=timedelta(minutes=10))
await workflow.execute_activity(unit_test, retry_policy=RetryPolicy(max_attempts=3))
await workflow.execute_activity(integration_test, retry_policy=RetryPolicy(max_attempts=3))
await workflow.execute_activity(deploy, schedule_to_close_timeout=timedelta(minutes=15))
Each activity runs in its own Docker container, ensuring environment isolation. The containers are provisioned on demand via a Kubernetes Job, which the Temporal worker submits.
Temporal’s heartbeat mechanism was employed for long-running activities like integration tests. Workers send a heartbeat every 30 seconds; if the heartbeat stops, Temporal flags the activity as failed and triggers a retry.
To capture logs, the workflow writes activity output to a centralized ElasticSearch index, keyed by workflow ID. Developers can view logs directly from the Temporal UI, eliminating the need to SSH into build agents.
Versioning was handled using Temporal’s workflow versioning API. When Duolingo introduced a new linting step, they incremented the workflow version, allowing in-flight executions to continue on the old version while new runs adopted the updated definition.
Beyond the core CI steps, the team added a lightweight “notify-team” activity that posts a summary to Slack. This tiny addition turned a silent failure into a visible alert, reducing mean-time-to-acknowledge by 45%.
Overall, the refactor turned a fragile script chain into a resilient, observable process that could survive restarts, scale horizontally, and provide real-time insight into each step.
Metrics started to roll in almost immediately. The numbers tell a story that goes beyond anecdote.
Quantifying the 40% Onboarding Speedup
After deploying Temporal Nexus, Duolingo collected data from GitHub Actions and internal dashboards for a 90-day window. The average time from first commit to production-ready code dropped from 7.1 days to 4.2 days, a 40% reduction.
"New engineers now see their first PR merged in under five days, compared to over a week before Temporal. This translates to 1.9 fewer idle days per hire," said the engineering productivity lead.
Build duration also improved. The median end-to-end build time fell from 23 minutes to 15 minutes, a 35% gain, thanks to activity parallelism introduced in the workflow.
Failure rates decreased dramatically. The number of pipeline failures per 100 PRs fell from 18 to 6, reflecting the automatic retries and better error handling built into Temporal.
On the cost side, the team tracked on-call incidents related to CI failures. Incidents dropped from an average of 4.3 per week to 1.2, freeing senior engineers to focus on feature work rather than firefighting.
Employee satisfaction surveys conducted three months post-migration showed onboarding satisfaction rising to 81%, surpassing the pre-migration baseline of 62%.
These concrete metrics confirm that the investment in Temporal Nexus delivered measurable productivity gains across speed, reliability, and developer morale.
With the core CI pipeline now humming, the organization turned its attention to the next frontier: extending the same workflow mindset to non-code activities.
What started as a fix for flaky tests evolved into a platform for policy automation.
Future-Proofing Onboarding: Extending Temporal Workflows Beyond CI/CD
Buoyed by the CI/CD success, Duolingo expanded Temporal workflows to cover release documentation. An activity now auto-generates markdown changelogs by aggregating pull-request titles and linking to issue trackers.
Compliance scans are also automated. After a successful deploy, a workflow triggers a container that runs static code analysis and third-party license checks, publishing results to a compliance dashboard.
Cross-team notifications have been unified. When a workflow completes, Temporal fires a webhook that posts a formatted message to Slack, Teams, and the internal incident response system, ensuring every stakeholder receives consistent updates.
These extensions turn onboarding into a continuous, policy-driven experience. New hires no longer need to remember separate steps for documentation or compliance; the workflow enforces them automatically.
Temporal’s signal feature enables real-time overrides. If a security auditor needs to pause a deployment, they can send a signal to the running workflow, which then halts further activities until approval is granted.
Data from the first quarter of these extensions shows a 22% reduction in manual compliance effort and a 15% faster release note generation time.
By treating onboarding as an extensible workflow, Duolingo creates a living playbook that evolves with the organization’s needs.
Looking ahead, the team is prototyping a “mentor-hand-off” activity that pairs a senior engineer with a newcomer for a short code-review sprint, all orchestrated by Temporal’s scheduling capabilities.
Ready to try this approach in your own org? The following checklist walks you through a production-grade rollout.
Step-by-Step Guide to Replicating Duolingo’s Setup
1. Install Temporal Server: Deploy the open-source Temporal server on a Kubernetes cluster using the official Helm chart. Set the replica count to 3 for high availability.
2. Configure Namespaces: Create a dedicated namespace called ci-cd to isolate workflow executions from other business processes.
3. Define Activities: Write activity functions for each pipeline step in your preferred language. Duolingo used Python; the temporalio SDK provides decorators for activity registration.
4. Model the Workflow: Use the SDK to compose activities in the desired order, adding retry policies and timeouts. Reference the code snippet in the "Building the CI/CD Workflow" section for a template.
5. Wire to CI System: Replace the existing Bash script with a thin wrapper that triggers the Temporal workflow via the REST API. Pass repository URL and commit SHA as input parameters.
6. Set Up Logging & Observability: Configure workers to ship activity logs to ElasticSearch or CloudWatch. Enable Temporal’s built-in metrics and connect them to Prometheus for dashboarding.
7. Implement Versioning: Use workflow.get_version to manage schema changes without disrupting in-flight executions.
8. Test in Staging: Run a pilot with a small team, monitor success rates, and adjust retry policies. Duolingo ran a 2-week pilot that cut failure rates by 70% before full rollout.
Following these steps will give you a production-ready Temporal-backed CI/CD pipeline that mirrors Duolingo’s resilient setup.
Once the pilot succeeds, scale the replica count, enable autoscaling for activity workers, and start adding non-code activities to the same namespace.
Every migration surfaces unexpected challenges; learning from Duolingo’s experience can save you time.
Best Practices and Common Pitfalls
Start Small: Begin with a single activity, such as checkout, and gradually migrate other steps. This reduces the blast radius of any misconfiguration.
Handle Versioned Workflows: When adding new activities, always bump the version number and provide a default path for older executions. Skipping this caused Duolingo’s early experiments to crash when a workflow definition changed mid-run.
Set Realistic Timeouts: Overly aggressive timeouts caused premature failures. Duolingo found that a 5-minute timeout for compile steps was too low; extending to 12 minutes eliminated false negatives.
Use Heartbeats: Long-running activities must send heartbeats; otherwise, Temporal treats them as stalled and retries unnecessarily, inflating resource usage.
Isolate Secrets: Store API keys in a secret manager and inject them at activity start. Hard-coded secrets in the original script led to security warnings during internal audits.
Monitor Worker Health: Enable Temporal’s worker-status metrics. A sudden drop in worker count usually signals a container-runtime issue that can be caught before it impacts pipelines.
Document Retry Policies: Keep a living document that explains why each activity has its specific retry count and backoff strategy. This transparency helps new hires understand the safety net built into the system.
By internalizing these habits, teams can avoid the common pitfalls that turn a promising workflow engine into another source of friction.
Temporal Nexus gave Duolingo a durable, observable, and cost-effective automation layer. The numbers speak for themselves, but the real win is a smoother, confidence-boosting onboarding experience for every engineer who joins the codebase.