From LangSmith to Stratix: A Migration Guide for Eval Pipelines

Author:

The LayerLens Team

Last updated:

Apr 30, 2026

Published:

Apr 30, 2026

Published by the LayerLens team. LayerLens is continuous evaluation infrastructure for AI. Stratix is the evaluation engine: 200+ models, agentic benchmarks, judge optimization, and audit-ready comparisons across vendors.

TL;DR

Three patterns drive most LangSmith-to-Stratix migrations: framework lock-in becomes a tax, judge calibration plateaus, or step-level scoring becomes mandatory.
Most concepts have a direct counterpart. The structural difference: Stratix versions and optimizes the judge prompt as a first-class workflow.
Data export is straightforward: traces, datasets, and feedback all export from LangSmith and import into Stratix with one-time field-name mapping.
The fastest week-one quality lift is re-running the existing LangSmith evaluator prompt through GEPA against the team's existing labeled feedback.
Live tracing migrates in two stages: dual-write for two to four weeks, then cutover. Dual-write is the safer pattern.

Teams that started LLM evaluation on LangSmith have shipped real value with it. Tracing, eval datasets, and feedback loops all work. The pattern that drives migrations to Stratix is not "LangSmith is bad." It is that production stacks evolve and the original eval choice stops fitting.

This guide is for engineers who have a working LangSmith eval pipeline and a reason to evaluate moving it. It covers the concept mapping, the data export path, the judge re-wiring, and the gotchas that show up in week one.

When does a team consider migrating off LangSmith?

Three patterns drive most of the migrations.

Framework lock-in becomes a tax. LangSmith assumes the LangChain ecosystem. Teams running mixed stacks (some LangChain, some pure SDK calls, some custom agent frameworks) end up writing adapters. Stratix is framework-agnostic by design.

Judge calibration plateaus. Default LangSmith judges work for chat-style outputs. Calibrating against domain-specific labeled data is possible but requires custom code. Stratix ships GEPA for prompt optimization out of the box.

Step-level scoring is needed. Output-level scoring stops being enough once a system has tools. Wiring step-level evaluation against agent traces is a first-class workflow in Stratix.

If none of those apply, a migration is not urgent. If one or more apply, the rest of this guide is a working playbook.

How do LangSmith concepts map to Stratix?

Most concepts have a direct counterpart. A few have a different shape.

LangSmith concept	Stratix equivalent	Notes
Trace	Trace	Same shape, captures full request/response and tool calls
Run	Trace step	Stratix scores at the step level natively
Dataset	Evaluation space (data tab)	Stratix attaches datasets to scoped evaluation spaces
Evaluator (LLM-as-Judge)	Judge	Stratix supports Gen 1 through Gen 4 judge architectures
Custom evaluator (code)	Scorer	Deterministic graders in Stratix
Feedback	Trace evaluation result	Stored per trace, anchored to judge version
Annotation queue	Labeling workflow	Stratix supports human-in-the-loop labeling on any trace
Experiment	Experiment	Both systems support paired runs against shared datasets

The structural difference: Stratix treats the judge prompt itself as versioned, optimizable, and re-runnable, where LangSmith treats it as a static evaluator config. That maps to a different daily workflow in week two.

What does the data export look like?

The export path is straightforward.

Step 1: Export traces. LangSmith's API supports paginated trace export as JSON. A few hundred lines of Python pulls everything in a project. The Stratix Learning Hub has the script.

Step 2: Export datasets. LangSmith datasets export as JSONL. Stratix imports the same format with a one-time mapping pass to align field names.

Step 3: Export feedback. Each LangSmith feedback record (score, comment, evaluator name, run id) becomes a Stratix trace evaluation result. The mapping is one-to-one. Anchor by the trace id when importing.

Step 4: Validate. Pick five traces at random, re-run them through the new judge in Stratix, and compare scores side by side. Score deltas under 0.5 (on a 1-to-5 scale) are normal noise. Anything larger means the rubric was reinterpreted across systems and needs explicit re-tuning.

The full migration of a project with around 50,000 traces typically completes in an afternoon, not a week.

What changes about the judge?

This is where most teams get the most lift.

In LangSmith, the judge is usually configured once and runs unchanged for months. In Stratix, judges live inside an evaluation space, are versioned on every change, and can be optimized with GEPA against a labeled set. The first migration step that delivers measurable quality lift is re-running the existing rubric through GEPA.

The pattern:

1. Take the existing LangSmith evaluator prompt as the seed.

2. Use the labeled feedback the team already collected as the GEPA fitness set.

3. Run a single GEPA pass.

4. Compare seed-prompt agreement and tuned-prompt agreement on the held-out set.

In documented runs, this single step lifts judge-human agreement by 8 to 15 points. The team did not change models, did not relabel data, did not write new code. The prompt got tuned.

What about live tracing?

Stratix accepts traces over a standard SDK with a drop-in client. For most teams the migration of live tracing happens in two stages.

Stage 1: Dual-write. The LangChain instrumentation (or custom SDK) writes traces to both LangSmith and Stratix in parallel for two to four weeks. Both systems see the same production data. The team validates Stratix is capturing the right signal before cutting over.

Stage 2: Cutover. Once Stratix is the source of truth and the team is confident in the dashboards and judges, the LangSmith write path is removed. LangSmith historical data remains queryable through the export above.

Dual-write is the safer pattern. It also surfaces any instrumentation gaps before they matter.

What are the week-one gotchas?

Every migration hits at least one of these.

Trace-id mismatches. LangSmith and Stratix mint trace IDs differently. If the migration script does not preserve the original trace ID as an alias, cross-references between historical LangSmith feedback and new Stratix scores will break. The Stratix import supports an external_id field for exactly this case.

Tool-call schema drift. LangSmith and Stratix store tool calls in slightly different shapes. The migration tool normalizes them. Check the first 20 imported traces by hand to confirm tool inputs and outputs round-tripped correctly.

Judge model mismatch. If the LangSmith judge was running on a particular base model and the Stratix judge defaults to a different one, the scores will not match even on identical prompts. Set the model explicitly in the Stratix judge config.

Dataset format edge cases. Multi-turn conversations exported from LangSmith sometimes flatten the turns. Spot-check a few multi-turn examples in the Stratix import and re-format if needed.

None of these are blocking. All of them are hours, not days.

Where do most teams land in week two?

The two changes that show up in almost every post-migration retrospective:

1. Judge agreement with human labels improves. GEPA-tuned judges land 8 to 20 points higher on Cohen's kappa than the seed prompts the team brought from LangSmith.

2. Step-level scoring exposes silent failures. Teams that only had output-level scoring on LangSmith discover lucky-path traces and reroll patterns they did not know existed. The first round of step-level findings usually drives a prompt or tool fix that lifts production quality directly.

Both are common. Both compound over time.

Run the migration in the Stratix Learning Hub

The Stratix Learning Hub has the export scripts, the field-mapping reference, and a worked dual-write example so a team can see the migration end to end before touching production data. Open the Hub here.

Key Takeaways

Migration is not urgent unless framework lock-in, judge calibration, or step-level scoring is binding the team's progress.
The first GEPA pass on the imported LangSmith evaluator prompt typically lifts judge-human agreement by 8 to 15 points with no model change.
Trace-id mismatches, tool-call schema drift, judge model mismatch, and dataset format edge cases are the four week-one gotchas.
Step-level scoring on imported traces routinely surfaces silent failures the team did not know existed, driving the first round of post-migration prompt and tool fixes.

Frequently Asked Questions

How long does a LangSmith to Stratix migration take?

For a project with around 50,000 traces, the data migration completes in an afternoon, not a week. Live-trace cutover usually runs as a two-to-four-week dual-write window for safety, then the LangSmith write path is removed.

Does Stratix support LangChain?

Yes. Stratix is framework-agnostic and supports LangChain, custom SDK calls, and other agent frameworks side by side. That is the main reason teams running mixed stacks migrate.

Will scores match between LangSmith and Stratix on the same trace?

Score deltas under 0.5 (on a 1-to-5 scale) are normal noise during migration. Anything larger usually means the rubric was reinterpreted across systems and needs explicit re-tuning. Always validate with five randomly sampled traces before cutover.

What is the dual-write window?

A two-to-four-week period where the production instrumentation writes traces to both LangSmith and Stratix in parallel. The team validates Stratix is capturing the right signal before cutting over the source of truth. It is the safer migration pattern.

Methodology

Migration patterns and gotchas described here aggregate documented LangSmith-to-Stratix migrations across Stratix evaluation spaces. Lift figures (8 to 15 points of judge-human agreement after the first GEPA pass on imported prompts) reflect typical week-one results when teams bring an existing LangSmith judge prompt as the seed and use their existing labeled feedback as the GEPA fitness set.

Open the Stratix Learning Hub to run the templates and worked examples referenced in this post against your own data.

‹ The Builder Path: From First Trace to Production-Grade AI Evaluation

AI Evaluation Glossary: 25 Terms Every ML Team Needs in 2026 ›