How Basewise manages hallucination risk in Systems Engineering

7 min read
Jun 13, 2026 9:48:48 AM

AI in Systems Engineering does not have to be perfect to be useful. It does, however, have to be reliable enough to trust in practice. This is how we engineer for that.

That is the real concern for most systems engineers. AI can produce fluent output relatively easily, but getting that output to be consistent, correct and blunt is the real challenge. It matters even more considering requirements work, verification, and engineering analysis is subject to compliance checks — and thus, hallucination poses a real risk to the process.

Basewise does not claim to be hallucination-free, and we do not guarantee that AI output will never be wrong. That would not be a credible claim in a systems engineering context. What we do claim is more practical: Basewise is engineered to manage hallucination risk to a very high reliability level, so that engineers can substantially reduce manual effort while keeping final control where it belongs.

Halucinations

For systems engineering teams, that is the useful operating model. We are not trying to remove judgment. We are trying to reduce the manual burden around the work, while keeping the output grounded, explainable, and reviewable. Effectively our mission is to scale and deploy human expertise and limit waste.


Why hallucination is a serious issue in Systems Engineering

Systems Engineering is particularly vulnerable to hallucination because meaning is highly contextual. Requirements often depend on contract language, stakeholder intent, standards, phase-specific expectations, and interfaces with other parts of the system. Small interpretation errors can propagate into requirements, interface agreements, verification claims, or even safety arguments — and those errors can become costly to unwind later.

That is why Basewise treats AI output as a proposal, not as evidence.

For output to be trusted in an engineering workflow, it must be possible to show:

  • Which sources were used.
  • What assumptions were made.
  • How the result fits the baseline.
  • What review and acceptance steps were completed.

Put simply: trust should only extend to the level at which output can be verified, traced, and audited. Anything beyond that still needs explicit human oversight.


Our reliability philosophy: hybrid, layered, and reviewable

Basewise uses a hybrid reliability approach. We do not rely on one prompt, one model, or one control to keep hallucination risk down. The safeguards and checks are too complex to work properly on their own. Reliability comes from the way they work in conjunction.

Describe the first action here.


The safeguards work together, not in rank order

It is tempting to ask which safeguard matters most. In reality, none of them is sufficient on its own.

Prompts tailored to the task and the model

All prompts in Basewise are app-specific and model-specific. They are tuned through large variance testing and follow structured paths to collect, analyse, interpret, score, and decide.

That matters because many hallucination issues start with task ambiguity. If the prompt leaves too much room for interpretation, the model may mix analysis, rewriting, assumption-making, and decision logic in ways that are hard to review later.

Basewise avoids that by using multi-step prompts in every application. Some workflows are also multi-agent, but even where they are not, the task is still broken down into controlled stages rather than handled as one unconstrained request.

Model choice based on fitness, not size alone

Larger models are not always better. In some engineering tasks, a smaller model can be more deterministic, more focused, and less likely to get distracted by unnecessary reasoning paths.

Basewise uses multiple models, each tested for purpose and each instructed differently depending on the model type. We route tasks to different models:

  • a larger model for interpretation, reasoning, or more complex operations
  • a smaller model for determinism, speed, and tighter task discipline

The rule is straightforward: output quality comes first, cost comes second. Different parts of the same app may need different behaviour. A good engineering AI system should not treat every task as if it requires the same kind of intelligence.

Evidence and retrieval are critical — especially in REF

For REF in particular, retrieval is not a thin add-on. It is a core part of the reliability model.

REF is heavily grounded in user-provided data, and the retrieval pipeline is designed for cases where users may provide very large document sets — sometimes over a gigabyte — where detail matters. As much engineering has gone into the retrieval mechanisms as into prompting and model usage.

Basewise uses a customized engine with dynamic search strategies, review paths, escalation logic, confidence scoring, and analysis trace. The process is designed to stay close to the original evidence rather than allowing the model to drift into unsupported synthesis.

This is also where multi-pass review matters. Evidence retrieval and assessment are reviewed through more than one step to reduce bias and variance, especially when interpretation is not straightforward.

That principle matters for requirements, interface agreements, and verification claims. In many cases, Basewise requires source citation both inside the model workflow and in the output presented to the user.

Forced citation helps in two ways: it constrains the system toward grounded output, and it gives the human reviewer a direct route back to the original material.

Confidence scoring

Confidence scoring works hand in hand with evidence and traceability. It helps engineers quickly see where the output is well-supported and where the source material is more ambiguous.

A good review workflow should not force engineers to spend the same effort on every item. High-certainty items can usually be reviewed quickly. Lower-confidence items deserve closer attention. Basewise makes that distinction visible through confidence scoring and user-readable analysis output.

Confidence score

Agent separation where needed

Some tasks are better handled by separating them into distinct jobs. Where necessary, Basewise uses agent separation to reduce task obscurity and minimise the risk that analysis, interpretation, scoring, and decision logic become muddled together. This is especially useful when one stage of the process should challenge or verify the result of another.

The point is not complexity for its own sake. The point is to keep each part of the reasoning pipeline focused and reviewable.

Human review remains part of the design

Basewise supports both more autonomous and human-in-the-loop workflows. In both cases, we recommend human verification.

That is not because the output is weak. It is because engineering accountability still matters. In systems engineering, the goal is not to eliminate expert involvement. The goal is to reduce manual effort on the straightforward cases and reserve expert time for the edge cases and judgement calls.


How this works in each application

The reliability controls are shared across the platform, but they are applied differently in each app — because the risks are different.

Extracting requirements without losing the source

DRE scans documents structurally and identifies requirements, constraints, assumptions, obligations, and implicit expectations. It is designed to read concepts, not just sentences, and to turn source documents into structured output that can feed downstream tools such as DOORS, Relatics, and related requirements workflows.

From a hallucination-risk perspective, DRE is protected by sophisticated task-specific prompting, model choice suited to extraction and rewrite behaviour, task separation and sequencing to limit distractions, and traceability through explanation.

Describe the first action here. (2)

Breaking complex analysis into smaller, reviewable steps

RQA has a different character than DRE because it is judging quality, identifying issues, and proposing improvements. It combines deterministic hard-rule checks with semantic analysis, evaluating requirement quality against INCOSE-aligned criteria: vagueness, absolutes, compound requirements, and other quality problems.

What matters most for hallucination control is the structure of the workflow. RQA breaks the analysis into smaller steps with internal and reported traceability at each stage. This compartmentalisation is critical because RQA can involve more than 15 distinct analyses before moving from analysis → observation → improvement → improved output.

If all of that happened in one loose prompt, the output would be much harder to trust. By breaking it into smaller stages, Basewise reduces the chance of unsupported jumps in logic and makes the reasoning more inspectable for the user.

Heavily grounded, retrieval-first, and phase-aware

REF is the most obviously evidence-based app in the platform. It performs live retrieval across user-provided documents and produces a verification register that includes the evidence found, source document, location in the document, score, verdict, and a substantiated explanation. Verdicts include Pass, Partial, and Fail.

Describe the first action here. (3)

REF is also phase-aware. The evidence threshold changes depending on whether the project is in design or delivery phase. That is important because one common hallucination pattern is overstating plans or intent as if they were completed proof. REF is designed to avoid that by judging evidence in context.

In this app more than any other, Basewise’s reliability comes from the combination of retrieval engineering, structured reasoning, source citation, confidence scoring, and traceable output.


What happens when Basewise is uncertain

This is one of the most important parts of the design.

If Basewise is not confident, it does not simply present a polished answer and hope the user accepts it. Instead, it can escalate the case into more complex reasoning loops and additional evaluations. If uncertainty remains, the item is flagged with a lower confidence score and the reason for the uncertainty is made visible.

Some requirements or evidence sets are genuinely ambiguous. In those cases, humans often find the material open to interpretation as well. A good AI system should not pretend that ambiguity has disappeared. It should surface it early and clearly.


A note on repeatability and testing

Basewise also tests for output variance at scale. Benchmark testing has gone through hundreds of runs to identify even small fluctuations in output. The internal accuracy standard used here is strict: variations in score, analysis, or advice all count as variation, even if the final recommendation remains broadly similar.

That kind of testing does not mean the system is perfect. It does show that reliability is being treated as an engineering problem, not as a marketing claim.

Conclusion

Basewise does not eliminate hallucination risk, and we do not guarantee hallucination-free output. What we do provide is a hybrid, layered reliability approach designed to make AI useful in real systems engineering workflows. The aim is practical:

  • produce highly reliable output
  • keep it grounded in evidence and traceability
  • make uncertainty visible
  • reduce the amount of manual engineering effort required
  • leave the final call with the expert

In practice, that means systems engineers will usually move quickly through the high-certainty output and spend more time on the smaller number of difficult or ambiguous cases. That is exactly where expert effort is most valuable.