Kosli recently released kosli evaluate trail, a command that evaluates selected attestations in a Kosli trail against a Rego policy file. We used it to build a complete and useful solution for tracking Snyk container vulnerabilities for cyber-dojo (an open-sourced browser based online tool for practising TDD which Kosli uses for demos). You’ll read about what we built, why we built it, how we tested it, and specifically:
- how it’s used in build workflows, in promotion workflows, and also in workflows than run “live” scans on already deployed artifacts
- how it runs with zero-trust against a policy defined in Rego and params files
Table of contents
The Problems
When you run snyk container test, any vulnerability with an ignore entry in the .snyk policy file is filtered out of the SARIF output before Kosli ever sees it. That filtering is silent. From Kosli’s point of view, the vulnerability does not exist.
- You lose visibility. You cannot see which vulnerabilities (CVEs) exist but are being ignored, or whether their ignore entries have expired.
kosli attest snykcreates a non-compliant attestation for any new vulnerability not in the.snykfile, regardless of any other consideration, such as its severity. Suppose we want to treat new low-severity vulnerabilities and new critical-severity vulnerabilities differently?kosli attest snykproduces a single attestation covering all vulnerabilities for an artifact. You cannot evaluate individual vulnerabilities in isolation, track when each first appeared, or mark one as compliant while another is not.
What we want is:
- visibility into all vulnerabilities
- compliance controlled by explicit rules based on severity and age (for example), not by whether a vulnerability happened to appear in a
.snykfile before Kosli saw the SARIF output. - workflows that easy to manage, and do not block us during frequent bursts of new low-severity CVEs.
Design overview
The core artifact_snyk_test.yml reusable workflow runs a snyk container test and writes one “low-level” data attestation, called snyk, for each CVE found.
We use kosli evaluate trail to evaluate these multiple trails independently, and aggregate their results into a single artifact-level attestation: compliant only if every individual kosli evaluate trail passes.
kosli evaluate trail is what makes this architecture practical. Without per-trail evaluation, you cannot reason about individual vulnerabilities in isolation and this entire design collapses back into a single pass/fail judgment for the artifact as a whole.
The snyk scan and the .snyk policy file
artifact_snyk_test.yml runs the Snyk scan without the .snyk policy file, so all vulnerabilities appear in the SARIF output.
But the SARIF output alone is not enough. The Rego policy needs to know whether the artifact’s .snyk file has an active ignore entry for each vulnerability, and if so, whether that entry has expired. Without that information baked into the attestation data, the policy cannot distinguish between a vulnerability that is genuinely unaddressed and one that has been deliberately accepted with an expiry date.
A simple python script merges the two sources:
- the SARIF to extract vulnerability IDs and severities
- the
.snykYAML to add any ignore expiry dates. The.snykfile is
fetched at scan time from the artifact’s specific commit SHA, not from HEAD. - any ignore entries in the
.snykfile that have no matching vulnerability in the SARIF output are reported in the GitHub step summary, flagging entries that can be cleaned up.
The result is a JSON array with one object per vulnerability.
Write one data attestation per vulnerability
For each object in the JSON array we make a data attestation:
- the flow is called
snyk-{env}-per-vulnwhereenvis eitheraws-betaoraws-prod - the trail name follows the pattern:
{repo_name}-{severity}-{CVE_ID}.
For example runner-high-SNYK-GOLANG-GOLANGORGXCRYPTOSSHAGENT-14059804 means the snyk vulnerability GOLANG-GOLANGORGXCRYPTOSSHAGENT-14059804, whose severity is high, was found in the artifact built from the runner repository.
Evaluate each data attestation against a Rego policy
We run kosli evaluate trail against each vulnerability trail, evaluating its data attestation and applying a Rego policy that decides compliance. The Rego policy has four cases:
- Active ignore entry: the
.snykfile contains an ignore entry with an
expiry date that has not yet been reached and is not more than
max_ignore_expiry_daysin the future. Compliant. - Expired ignore entry: the expiry date in the
.snykfile is in the past.
Non-compliant. - Ignore entry too far ahead: the expiry date is more than
max_ignore_expiry_daysin the future. Non-compliant. - No ignore entry: the
.snykfile has no ignore entry for this
vulnerability, so there is no expiry date to check. Compliance instead
depends on the age of the vulnerability, against a per-severity
threshold. The thresholds differ betweenaws-betaandaws-prodruntime environments. The trail is created the first time the vulnerability is
seen (per repo). Itscreated_attimestamp becomes the “first seen” date. On every
subsequent scan, the same trail is reused and its creation date is
preserved. No separate database is needed.
The Rego policy
Rego (https://www.openpolicyagent.org/docs/latest/policy-language/) is the policy language of the Open Policy Agent (https://www.openpolicyagent.org/) (OPA) project. You write rules in a .rego file that express what it means for some input data to be compliant. OPA evaluates the rules against the input and returns a result – in our case, allow (a boolean) and violations (a set of diagnostic strings). Rules can reference external data via data.params, which lets you separate the policy logic from the thresholds it enforces. You supply a params file (plain JSON) at evaluation time, and OPA makes its fields available to the rules as data.params.*. This lets us use one .rego file and two params files to enforce different thresholds for aws-beta and aws-prod.
Environment-specific params files
The params files containing thresholds for aws-prod is:
{
"max_days_by_severity": {
"critical": 0,
"high": 2,
"medium": 4,
"low": 10
},
"max_ignore_expiry_days": 30
}
For example - the medium threshold is 4 days, meaning a new (for that repo) medium severity vulnerability (such as, deep breath, SNYK-GOLANG-GITHUBCOMSIGSTORETIMESTAMPAUTHORITYV2PKGVERIFICATION-16134930), in aws-prod will cause non-compliance in 4 days, unless, for that repo, you:
- deploy a new artifact without the vulnerability, to
aws-prod, or - add an
ignoreentry for the vulnreability to the.snykfile
Note:
- attestations are immutable and are recorded against the fingerprint of the artifact. It is impossible to fix the vulnerability on the actual running artifact.
critical: 0meansage_days(vuln) < 0, which is never true. Any critical
vulnerability in (or being deployed to)aws-prodis non-compliant.
The Rego policy lives in snyk-vuln-compliance.rego.
It starts by aliasing the data.params fields, giving them shorter names for use
throughout the policy.
package policy
import rego.v1
max_days_by_severity := data.params.max_days_by_severity
max_ignore_expiry_days := data.params.max_ignore_expiry_days
...
Compliance must be driven via a positive assertion
vuln_of reads the JSON from the attestation named snyk, the individual data attestation for one vulnerability:
vuln_of(trail) := trail.compliance_status.attestations_statuses["snyk"].attestation_data
age_days finds the age of a vulnerability in days:
seconds_per_day := 60 * 60 * 24
age_days(vuln) := (vuln.now_ts - vuln.first_seen_ts) / seconds_per_day
Then the core functionality:
allowdefaults tofalseallowis only set totruevia a positive assertion throughtrail_is_compliant.- a trail is compliant when:
- there is no
.snykignore entry and the vulnerability age is within the per-severity Rego threshold, or - there is an active
.snykignore entry (not expired, not too far in the future)
- there is no
...
default allow := false
# Use < so that critical (max=0) is non-compliant on day zero
age_within_limit(vuln) if {
vuln.ignore_expires_exists == false
age_days(vuln) < max_days_by_severity[vuln.severity]
}
ignore_is_active(vuln) if {
vuln.ignore_expires_exists == true
vuln.ignore_expires_ts >= vuln.now_ts
vuln.ignore_expires_ts <= vuln.now_ts + (max_ignore_expiry_days * seconds_per_day)
}
# Case 1: no .snyk ignore entry -- age determines compliance
# Case 2: .snyk ignore entry exists and is active (not expired and expiry date not too far in the future) -- compliant regardless of age
trail_is_compliant(trail) if age_within_limit(vuln_of(trail))
trail_is_compliant(trail) if ignore_is_active(vuln_of(trail))
allow if trail_is_compliant(input.trail)
...
Never produce a false-positive compliant
The Rego evaluation must never incorrectly produce a false-positive compliant result. This can happen quite easily unless we understand OPA’s undefined behavour failure mode, and is why compliance must be driven via a positive assertion. In a compliance-path rule (one that can make allow true), an undefined reference is dangerous if the rule is negated. For example:
ignore_too_far_ahead(vuln) if {
vuln.ignore_expires_exists == true
vuln.ignore_expires_ts > vuln.now_ts + (max_ignore_expiry_days * seconds_per_day)
}
ignore_is_active(vuln) if {
vuln.ignore_expires_exists == true
vuln.ignore_expires_ts >= vuln.now_ts
not ignore_too_far_ahead(vuln) # dangerous
}
A missing max_ignore_expiry_days param would silently produces false compliance:
max_ignore_expiry_days * seconds_per_dayis undefinedignore_too_far_aheadfails to fire and is treated asfalsenot falseistrue- the guard passes vacuously
ignore_is_activefirestrail_is_compliantfiresallowis incorrectlytrue
The safe pattern is a positive assertion in place of the negation:
ignore_is_active(vuln) if {
vuln.ignore_expires_exists == true
vuln.ignore_expires_ts >= vuln.now_ts
vuln.ignore_expires_ts <= vuln.now_ts + (max_ignore_expiry_days * seconds_per_day)
}
Now, if max_ignore_expiry_days is absent from the params file:
- the third condition fails to evaluate
ignore_is_activealso failsallowdefaults tofalse, the correct fail-safe outcome
Violations provide diagnostics only
Violations must not drive the allow decision. The absence of violations must not drive trail_is_compliant.
In a violations rule, an undefined reference causes the rule body to fail silently: no diagnostic message is produced. This is the safe failure mode – a lost message, not a lost check. (See OPA issue #1857.)
Case 1 violation: no .snyk ignore entry and vulnerability age exceeds the threshold for its severity
violations contains msg if {
vuln := vuln_of(input.trail)
vuln.ignore_expires_exists == false
not age_within_limit(vuln)
msg := sprintf(
"trail '%v': %v severity vuln age %d days exceeds %d day limit for severity %v",
[input.trail.name, vuln.full_id, age_days(vuln), max_days_by_severity[vuln.severity], vuln.severity],
)
}
Case 2 violation: .snyk ignore entry exists but is not active
inactive_ignore_msg(trail) := msg if {
vuln := vuln_of(trail)
ignore_has_expired(vuln)
msg := sprintf(
"trail '%v': %v snyk ignore entry expired at %v",
[trail.name, vuln.full_id, vuln.ignore_expires],
)
}
inactive_ignore_msg(trail) := msg if {
vuln := vuln_of(trail)
ignore_too_far_ahead(vuln)
msg := sprintf(
"trail '%v': %v snyk ignore entry expiry %v is more than %d days ahead",
[trail.name, vuln.full_id, vuln.ignore_expires, max_ignore_expiry_days],
)
}
violations contains msg if {
vuln := vuln_of(input.trail)
vuln.ignore_expires_exists == true
not ignore_is_active(vuln)
msg := inactive_ignore_msg(input.trail)
}
Seeing it in action
The core artifact_snyk_test.yml workflow implements a zero trust snyk attestation. The Rego and params files control the policy.
- it does a snyk container test on a specified artifact.
- it writes a single artifact-level aggregate attestation, to the workflow’s trail.
- the attestation is compliant if and only if every
kosli evaluate trailreturns zero for each aggregated, one per vulnerability, attestation.
Here is an example of an attestation called runner.snyk-container-scan.
Each vuln_url_pass_NNN or vuln_url_fail_NNN annotation on this attestation is a link to its individual per-vulnerability trail. Here is the trail for 001 from above, a high-severity Go vulnerability found in runner:
Let’s see in detail, in the yaml of three different GitHub workflows.
1. Build workflows
Every cyber-dojo microservice repo has a main.yml workflow that builds a Docker image. After the image is built, the snyk-container-scan job calls the core artifact_snyk_test.yml reusable workflow. Here is an example for the runner repo:
runner/.github/workflows/main.yml#L165
snyk-container-scan:
needs: [build-image]
uses: cyber-dojo/snyk-scanning/.github/workflows/artifact_snyk_test.yml@main
with:
artifact_name: ${{ needs.build-image.outputs.tagged_image_name }}
kosli_flow: ${{ vars.KOSLI_FLOW }}
kosli_trail: ${{ github.sha }}
kosli_attestation_name: runner.snyk-container-scan
secrets:
snyk_token: ${{ secrets.SNYK_TOKEN }}
kosli_api_token: ${{ secrets.KOSLI_API_TOKEN }}
- The
artifact_nameis the just-built image we want to scan. - The flow is named after the repo (eg
runner-cifor therunnerrepo). - The trail name is the git commit.
- The attestation name is
runner.snyk-container-scan - The
snyk_tokengives permission to run the scan - The
kosli_api_tokengives permission to write the attestation
The workflow’s subsequent sdlc-control-gate job “gates” the deployment of the runner artifact to its aws-beta environment, using kosli assert artifact
runner/.github/workflows/main.yml#L315
sdlc-control-gate:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
needs:
- build-image
- snyk-container-scan
...
env:
KOSLI_FINGERPRINT: ${{ needs.build-image.outputs.digest }}
steps:
...
- name: Kosli SDLC gate to short-circuit the workflow
run:
kosli assert artifact
--environment="${KOSLI_AWS_BETA}"
The environment policy of aws-beta includes a compliant snyk attestation:
- if the artifact has a non-compliant snyk attestation the
kosli assert artifactcommand will exit with non-zero, we will not reach the deployment job, and the image will not be deployed toaws-beta. - if the artifact has a compliant snyk attestation, and meets all other
aws-betapolicy requirements, thekosli assert artifactcommand will exit with zero, we will reach the deployment job, and the image will be deployed toaws-beta
2. Promotion workflows
The aws-prod-co-promotion repo holds workflows for promoting one or more cyber-dojo artifacts from aws-beta to aws-prod. The promote-all.yml workflow runs the core artifact_snyk_test.yml workflow, for each artifact (found using kosli get snapshot), this time applying the stricter aws-prod Rego params:
aws-prod-co-promotion/.github/workflows/promote_all.yml#L91
snyk-scan:
needs: [setup, find-promotions]
strategy:
matrix:
include: ${{ fromJSON(needs.find-promotions.outputs.promotions) }}
uses: cyber-dojo/snyk-scanning/.github/workflows/artifact_snyk_test.yml@main
with:
artifact_name: ${{ matrix.incoming_image_name }}
kosli_flow: ${{ vars.KOSLI_FLOW }}
kosli_trail: ${{ needs.setup.outputs.kosli_trail }}
kosli_attestation_name: ${{ matrix.incoming_repo_name }}.snyk-scan
kosli_env: ${{ vars.KOSLI_AWS_PROD }}
repo_name: ${{ matrix.incoming_repo_name }}
raw_snyk_policy_url: https://raw.githubusercontent.com/cyber-dojo/${{ matrix.incoming_repo_name }}/${{ matrix.incoming_commit_sha }}/.snyk
secrets:
snyk_token: ${{ secrets.SNYK_TOKEN }}
kosli_api_token: ${{ secrets.KOSLI_API_TOKEN }}
Note:
- The
snyk-scanjobs run in parallel for all artifacts being promoted. - Each “aggregate” attestation (one per artifact) is written to a flow representing this workflow’s promotion process:
vars.KOSLI_FLOW==production-promotion. kosli_trailispromote-all-${{ github.run_number }}.kosli_env(which defaults toaws-beta) is set toaws-prod, to pick up the stricter thresholds inrego.params.aws-prod.json.raw_snyk_policy_urlcontrols the identity of the.snykpolicy file (which defaults to the.snykfile in the current commit) ismatrix.incoming_commit_sha, the exact commit SHA that built the artifact being promoted, notHEAD.
Once again, the workflow’s subsequent sdlc-control-gate job “gates” the deployment of the artifacts to the aws-prod environment, using kosli assert artifact
aws-prod-co-promotion/.github/workflows/promote_all.yml#L135
sdlc-control-gate:
if: ${{ needs.find-promotions.outputs.promotions != '[]' }}
needs:
- setup
- find-promotions
- snyk-scan
runs-on: ubuntu-latest
strategy:
matrix:
include: ${{ fromJSON(needs.find-promotions.outputs.promotions) }}
env:
KOSLI_TRAIL: ${{ needs.setup.outputs.kosli_trail }}
...
steps:
...
- name: Assert Artifact is compliant for aws-prod
run: |
...
kosli assert artifact \
--fingerprint "${{ matrix.incoming_fingerprint }}" \
--environment "${KOSLI_AWS_PROD}"
The environment policy of aws-prod also includes a compliant snyk attestation:
- if any artifact has a non-compliant snyk attestation the
kosli assert artifactcommand will exit with non-zero, we will not reach the deployment job, and no images will be promoted toaws-prod. - if all artifacts have a compliant snyk attestation, and meet all other
aws-prodpolicy requirements, thekosli assert artifactcommands will exit with zero, we will reach the final deployment job, and all images will be deployed toaws-prod
3. Continuous environment scanning workflows
An image can pass its Snyk scan at build or promotion time but become vulnerable later. CVE databases are updated continuously, and new vulnerabilities are regularly discovered in packages that were considered safe when the image was built or deployed.
Two scheduled workflows, aws-beta.yml and aws-prod.yml, run once a day and scan every artifact currently running in each environment. They also trigger automatically when the Rego file or the relevant params file changes.
They call env_snyk_test.yml which finds all artifacts currently running in the environment, again using kosli get snapshot, and fan out to artifact_snyk_test.yml via a matrix strategy for each artifact:
find-artifacts:
...
steps:
- name: Generate JSON for each Artifact in KOSLI_ENV
id: set-artifacts
run: |
artifacts="$(make artifacts | jq --raw-output --compact-output .)"
echo "artifacts=${artifacts}" >> ${GITHUB_OUTPUT}
artifact-snyk-test:
needs: find-artifacts
strategy:
matrix:
include: ${{fromJSON(needs.find-artifacts.outputs.artifacts)}}
uses: ./.github/workflows/artifact_snyk_test.yml
with:
artifact_name: ${{matrix.artifact_name}}
kosli_flow: ${{inputs.kosli_flow}}
kosli_trail: ${{matrix.repo_name}}-${{matrix.artifact_fingerprint}}
kosli_attestation_name: ${{matrix.repo_name}}.snyk-container-scan
kosli_env: ${{inputs.kosli_env}}
raw_snyk_policy_url: ${{matrix.raw_snyk_policy_url}}
The attestations are written to a dedicated flow, called snyk-{env}-per-artifact, which holds one trail per artifact fingerprint, named {repo_name}-{fingerprint}. Scanning the same artifact twice produces a new attestation on the same trail. This records the full scan history for each artifact fingerprint as it runs in the environment.
As always, each attestation is made against the artifact fingerprint. Any non-compliant attestation will cause the environment to become non-compliant.
We need to know how many days until the next vulnerability will cause non-compliance, so that we can decide when to act and try to avoid non-compliance. A workflow job calls a simple python script find_expiring_vulns.py to find this information by reading the vuln-*.json files produced during the current scan run and send a slack channel message. If the workflow fails we also detect that and send an error message to the slack channel.
The slack message includes two links, one to a how-to-respond guide, and one to the Github workflow log, which includes step summaries of all vulnerabilities per artifact/repo, plus a table showing the vulnerablities across all artifacts by severity.
Testing
The tests live in the tests/ dir:
-
test_rego_rules.sh
tests the Rego policy directly usingkosli evaluate input, constructing JSON input for each case and asserting onallowandviolations. It covers all four compliance cases against both theaws-betaandaws-prodparams files. It also has tests that guard against OPA’s undefined behaviour. -
test_rego_params.shguards the invariant that everyaws-prodlimit is at most the equivalentaws-betalimit. It checks all fourmax_days_by_severityvalues (critical, high, medium, low) andmax_ignore_expiry_days. -
test_combine_snyk.shexercisesbin/combine_snyk.pyend-to-end with real
SARIF fixture files and.snykYAML fixtures. Test cases: zero vulnerabilities, one medium vulnerability without an ignore entry, one medium vulnerability with an active ignore entry, and two vulnerabilities. -
test_artifacts.shcoversbin/artifacts.py, which converts the
kosli get snapshotJSON into a GitHub Actions matrix. It tests GitHub and GitLab repository URL formats. -
test_find_expiring_vulns_logic.pycoversbin/find_expiring_vulns.py, which identifies currently-compliant vulnerabilities that are approaching their deadline. It tests the two functions that drive the decision:dot_snyk_result(returns a result when a.snykignore entry has a future expiry date) andrego_result(returns a result when vulnerability age is still within the per-severity limit). -
test_print_expiring_vulns_summary.shcoversbin/print_expiring_vulns_summary.py, which formats the JSON output fromfind_expiring_vulns.pyas a Markdown step summary with one table per Snyk severity level (low, medium, high, critical), each sorted bydays_remainingascending.
Summary
We started with three problems: ignored vulnerabilities were silently filtered out of the SARIF output before Kosli ever saw them, so there was no visibility into what was being suppressed or why; a single all-or-nothing snyk attestation meant any new CVE blocked deployment regardless of severity; and with one attestation covering all vulnerabilities, there was no way to track individual CVEs, their age, or their compliance status over time.
Three workflows, CI build, promotion, and continuous scanning, now share the same artifact_snyk_test.yml workflow, with different inputs for kosli_env, kosli_flow, kosli_trail, and kosli_attestation_name. The Rego policy is the single source of truth for compliance rules. The per-environment params files control the thresholds.
kosli evaluate trail(s) is what ties it all together and makes an aggregate attestation architecture viable: one data attestation per vulnerability, evaluated individually, aggregated into a single artifact-level result.
New CVEs no longer block the development workflow the moment they appear causing bursts of Snyk whack-a-mole. Neither do they necessarily cause an environment to immediately turn non-compliant. Both are controlled by a grace period determined by policy.