What "quality guarantee" usually means and what we mean

Most annotation vendors put a quality guarantee on their pricing page. In practice, "guarantee" often means: if you find errors, they fix them. The vendor decides what counts as an error. There's no measurable target and no contractual recourse.

Our quality guarantee is structured differently. There's a measured target. There's a defined measurement protocol. There's a defined consequence if we miss the target. The structure is the same whether you're running a 500-sample pilot or a 5-million-feature production project. Here's how it works.

The target: 98%+ F1 on infrastructure classes

For the asset classes where we have established expertise (road signs, lane markings, utility poles, traffic signals, pavement features, guardrails, crosswalks), our quality target is 98% F1 score against the agreed schema. F1 is the harmonic mean of precision and recall — it captures both false positives (labels that shouldn't exist) and false negatives (features the annotation missed).

98% is what we routinely hit. It's not the absolute ceiling — some projects we measure 99%+, some we measure 96-97% on the first pass. But 98% is the contractual floor.

For non-standard classes (custom schemas, novel asset types) we set the target during the pilot when we have measured performance to anchor on. We don't quote a target before we've labeled enough samples to know what's achievable.

How we measure: held-out test set

Before the pilot starts, we set aside a held-out test set: typically 50-200 frames that the annotators don't see during labeling. The client labels the held-out set themselves (or has their domain expert label it). After we deliver, we score our labels against the held-out set's labels.

The scoring is per-class. If we hit 98% F1 on stop signs, 96% on lane markings, and 92% on a tricky class like "construction-zone temporary signage," we report all three numbers. We don't average them and claim 95.3%.

This measurement happens for every batch, not just the pilot. Production deliveries include a QA report with per-batch, per-class F1 scores against a rolling test set.

How we hit the target: three-pass QA

Pass 1: annotator self-review

Every annotator reviews their own work at the end of a job. CVAT's interface flags any annotations that violate basic schema constraints (missing required attributes, geometry outside acceptable bounds). The annotator fixes those before submitting.

This pass catches mechanical errors and is usually 95-98% accurate on its own. It's the floor, not the goal.

Pass 2: peer review

A second annotator reviews the first annotator's work on a sample of every job. The sample percentage scales with measured quality — high-performing annotators get smaller samples; new or under-performing annotators get larger samples (up to 100%).

Peer review catches inter-annotator disagreement. When the reviewer disagrees with the original annotator, the case escalates to a senior reviewer for a final call, and both annotators see the resolution. This is how the team calibrates.

Pass 3: senior spatial QA

A senior reviewer with GIS credentials runs a final spatial QA pass on every batch. The reviewer uses our QA dashboard which overlays the labels on authoritative spatial layers (where available) and flags labels that disagree.

Spatial QA catches the class of errors that visual QA misses entirely — geographically impossible labels, misaligned coordinate systems, classification errors that only become visible when you can see the surrounding network. This is the pass that takes annotation quality from "looks right" to "is right."

What happens when we miss the target

Contractually, when our delivered batch measures below the target F1, we re-work the batch at our cost until it hits target. Not a discount on the next batch. Not a credit. Actual re-work.

In practice this rarely triggers. When it does, it's usually because the schema needed a clarification we hadn't anticipated. We update the schema, re-label the affected features, and re-deliver.

The clients who care most about this provision are usually the ones who use the data for safety-critical or regulatory work. They want to know that the financial incentives are aligned with quality, not with shipping volume.

Why outsourcing this works

Three reasons clients pick us over building an in-house annotation team.

Specialization compounds. We've labeled millions of features across road infrastructure, utility assets, environmental features, and remote-sensing imagery. The team knows the failure modes, the edge cases, the schema discussions that need to happen up front. An in-house team building this knowledge from scratch takes 12-18 months and burns through three project cycles to get there.

Capacity is elastic. During training-data sprints, you need 6-10 annotators for 8 weeks. Between sprints, you need 1. Hiring and laying off annotators on that schedule isn't a viable strategy. Outsourcing converts the capacity problem to a vendor problem.

Tooling and infrastructure is amortized. Running CVAT at scale, maintaining the QA dashboard, integrating with authoritative GIS layers, building the export pipelines — these are real engineering investments. We've made them once and amortize them across every client. An in-house team makes those investments per-company.

What we're not good at

Honest list.

We're not the cheapest. If you're optimizing for $0.10/frame on simple classes, there are vendors who'll do it. Our floor is higher because the QA process is higher.

We're not fastest at ramp. New clients usually need 1-2 weeks before we're at full production speed. The first week is schema work and pilot. Vendors with templated workflows and pre-trained generic classes can start labeling on day 2; we tend to deliver a more usable result if you can wait a week.

We don't operate at the scale of mega-vendors. Our peak monthly capacity is around 1.5M features. Above that, we'd recommend Scale AI or similar.


Want to run a pilot under these guarantees? Fixed-fee 500-sample pilot in 2-4 business days. The QA report you get back will show per-class F1 against your held-out set. You decide whether to scale based on measured performance, not pitched promises.