Wazuh Detection Harness: automated alert validation per ATT&CK technique
A Python tool that queries the Wazuh Indexer REST API after Atomic Red Team tests and tells you — per technique — whether your detection rules actually fired. Not "does the rule exist in the config." Does it produce an alert when the attack runs. That's a different question. This harness answers it.
Context
A rule can be syntactically valid, deployed to Wazuh, and still never fire in practice. The logsource might not match. The Sysmon config might not capture the required event. The agent might not have the right policy. None of that is visible from the rule file alone. The only way to know a detection works is to trigger the attack, look for the alert, and document whether it appeared. This harness automates that verification step — write a YAML spec, run the harness, get a pass/fail report with evidence artifacts for every test.
Problem
- Language: Python 3
- Input:
expected_detections.yaml— per-test specs with rule ID, rule groups, MITRE technique IDs, and must-contain strings - Query target: Wazuh Indexer (OpenSearch) via REST API —
wazuh-alerts-4.x-*index - Auth: environment variables (
WAZUH_INDEXER_HOST,WAZUH_INDEXER_USER,WAZUH_INDEXER_PASS) - Output: timestamped run folder with
report.md,matches.json,query_debug.json - Exit code: non-zero if any expected detection fails — CI-compatible
Approach
tests:
- test_name: ART_T1110_001_password_guessing
platform: windows
agent_name: WINDOWS-PRIMARY
lookback_minutes: 30
expected:
rule_id: "60204"
rule_groups:
- authentication_failures
must_contain:
- "logon"
- test_name: ART_T1003_001_lsass_dump
platform: windows
agent_name: WINDOWS-PRIMARY
lookback_minutes: 30
expected:
rule_groups:
- credential_access
mitre_ids:
- T1003
must_contain:
- "lsass"
Evidence
export WAZUH_INDEXER_HOST='localhost' export WAZUH_INDEXER_USER='admin' export WAZUH_INDEXER_PASS='[REDACTED_INTERNAL]' export WAZUH_TLS_INSECURE='true' # From project root — runs Atomic tests on the endpoint first, # then call the harness to check whether detections fired: ./scripts/new_run.sh # Output: run_02-19-2026_HHMMSS/ # report.md — pass/fail per technique # matches.json — raw alert matches # query_debug.json — exact OpenSearch queries sent
Outcome
Run from 2026-02-19, 3 tests against the primary Windows endpoint:
Run: run_02-19-2026_034113 Result: 0/3 PASS | Test | Status | Expected | Matches | |--- |--- |--- |--- | | ART_T1110_001_password_guessing | FAIL | rule_id=60204; groups=auth_fail | 0 | | ART_T1059_001_powershell | FAIL | groups=powershell; mitre=T1059 | 0 | | ART_T1003_001_lsass_dump | FAIL | groups=credential_access; T1003 | 0 |
0/3 PASS is useful information. It tells you the lookback window was too narrow, the Atomic test artifacts didn't match the detection logic, or the Sysmon/agent config isn't capturing these events. That's a detection gap analysis, not a failure. A SOC that can run this harness and get 0/3 is in a better position than a SOC with no harness — because it now knows specifically where coverage is missing.
The SOC report from the same sprint (a separate detection validation tool) achieved 1/5 PASS — T1110 password guessing fired rule 60204 on the primary Windows endpoint at exactly 2026-02-19T02:20:25.678+0000, alert ID 1771467625.2810738. That match was captured and preserved.
Why 0/3 is worth showing
Detection engineering portfolios often show only the successes. This shows the tooling, the methodology, and the honest result. 0/3 in a test run is a detection gap report. The value is in the process:
- The harness exists and is runnable
- The spec is version-controlled and reproducible
- The run artifacts are timestamped and preserved
- The gaps are now addressable — tune detection rules and re-run
Lessons + next hardening step
- No active response — harness only reads from Wazuh Indexer, never writes or modifies the environment
- No Atomic execution — harness validates detections, does not run attacks; separation of concerns
- CI-compatible exit codes — non-zero on any FAIL; can gate CI pipelines on detection coverage
- Timestamped run folders — each run is isolated; history is preserved for trend analysis
- YAML spec — version-controlled, reviewable, extensible without modifying Python code
- TLS insecure flag — explicit opt-in for lab self-signed certs; not default