Snyk vulnerability compliance with kosli evaluate trail

Kosli recently released kosli evaluate trail, a command that evaluates selected attestations in a Kosli trail against a Rego policy file. We used it to build a complete and useful solution for tracking Snyk container vulnerabilities for cyber-dojo (an open-sourced browser based online tool for practising TDD which Kosli uses for demos). You’ll read about what we built, why we built it, how we tested it, and specifically:

how it’s used in build workflows, in promotion workflows, and also in workflows than run “live” scans on already deployed artifacts
how it runs with zero-trust against a policy defined in Rego and params files

The Problems
Design overview
The Rego policy
Seeing it in action
Testing
Summary

The Problems

When you run snyk container test, any vulnerability with an ignore entry in the .snyk policy file is filtered out of the SARIF output before Kosli ever sees it. That filtering is silent. From Kosli’s point of view, the vulnerability does not exist.

You lose visibility. You cannot see which vulnerabilities (CVEs) exist but are being ignored, or whether their ignore entries have expired.
kosli attest snyk creates a non-compliant attestation for any new vulnerability not in the .snyk file, regardless of any other consideration, such as its severity. Suppose we want to treat new low-severity vulnerabilities and new critical-severity vulnerabilities differently?
kosli attest snyk produces a single attestation covering all vulnerabilities for an artifact. You cannot evaluate individual vulnerabilities in isolation, track when each first appeared, or mark one as compliant while another is not.

What we want is:

visibility into all vulnerabilities
compliance controlled by explicit rules based on severity and age (for example), not by whether a vulnerability happened to appear in a .snyk file before Kosli saw the SARIF output.
workflows that easy to manage, and do not block us during frequent bursts of new low-severity CVEs.

Design overview

The core artifact_snyk_test.yml reusable workflow runs a snyk container test and writes one “low-level” data attestation, called snyk, for each CVE found.

We use kosli evaluate trail to evaluate these multiple trails independently, and aggregate their results into a single artifact-level attestation: compliant only if every individual kosli evaluate trail passes.

Aggregate attestation diagram

kosli evaluate trail is what makes this architecture practical. Without per-trail evaluation, you cannot reason about individual vulnerabilities in isolation and this entire design collapses back into a single pass/fail judgment for the artifact as a whole.

The snyk scan and the .snyk policy file

artifact_snyk_test.yml runs the Snyk scan without the .snyk policy file, so all vulnerabilities appear in the SARIF output.

But the SARIF output alone is not enough. The Rego policy needs to know whether the artifact’s .snyk file has an active ignore entry for each vulnerability, and if so, whether that entry has expired. Without that information baked into the attestation data, the policy cannot distinguish between a vulnerability that is genuinely unaddressed and one that has been deliberately accepted with an expiry date.

A simple python script merges the two sources:

the SARIF to extract vulnerability IDs and severities
the .snyk YAML to add any ignore expiry dates. The .snyk file is
fetched at scan time from the artifact’s specific commit SHA, not from HEAD.
any ignore entries in the .snyk file that have no matching vulnerability in the SARIF output are reported in the GitHub step summary, flagging entries that can be cleaned up.

The result is a JSON array with one object per vulnerability.

Write one data attestation per vulnerability

For each object in the JSON array we make a data attestation:

the flow is called snyk-{env}-per-vuln where env is either aws-beta or aws-prod
the trail name follows the pattern: {repo_name}-{severity}-{CVE_ID}.

For example runner-high-SNYK-GOLANG-GOLANGORGXCRYPTOSSHAGENT-14059804 means the snyk vulnerability GOLANG-GOLANGORGXCRYPTOSSHAGENT-14059804, whose severity is high, was found in the artifact built from the runner repository.

Evaluate each data attestation against a Rego policy

We run kosli evaluate trail against each vulnerability trail, evaluating its data attestation and applying a Rego policy that decides compliance. The Rego policy has four cases:

Active ignore entry: the .snyk file contains an ignore entry with an
expiry date that has not yet been reached and is not more than
max_ignore_expiry_days in the future. Compliant.
Expired ignore entry: the expiry date in the .snyk file is in the past.
Non-compliant.
Ignore entry too far ahead: the expiry date is more than
max_ignore_expiry_days in the future. Non-compliant.
No ignore entry: the .snyk file has no ignore entry for this
vulnerability, so there is no expiry date to check. Compliance instead
depends on the age of the vulnerability, against a per-severity
threshold. The thresholds differ between aws-beta and aws-prod runtime environments. The trail is created the first time the vulnerability is
seen (per repo). Its created_at timestamp becomes the “first seen” date. On every
subsequent scan, the same trail is reused and its creation date is
preserved. No separate database is needed.

The Rego policy

Rego (https://www.openpolicyagent.org/docs/latest/policy-language/) is the policy language of the Open Policy Agent (https://www.openpolicyagent.org/) (OPA) project. You write rules in a .rego file that express what it means for some input data to be compliant. OPA evaluates the rules against the input and returns a result – in our case, allow (a boolean) and violations (a set of diagnostic strings). Rules can reference external data via data.params, which lets you separate the policy logic from the thresholds it enforces. You supply a params file (plain JSON) at evaluation time, and OPA makes its fields available to the rules as data.params.*. This lets us use one .rego file and two params files to enforce different thresholds for aws-beta and aws-prod.

Environment-specific params files

The params files containing thresholds for aws-prod is:

{
    "max_days_by_severity": {
        "critical": 0,
        "high":     2,
        "medium":   4,
        "low":      10
    },
    "max_ignore_expiry_days": 30
}

rego.params.aws-prod.json

For example - the medium threshold is 4 days, meaning a new (for that repo) medium severity vulnerability (such as, deep breath, SNYK-GOLANG-GITHUBCOMSIGSTORETIMESTAMPAUTHORITYV2PKGVERIFICATION-16134930), in aws-prod will cause non-compliance in 4 days, unless, for that repo, you:

deploy a new artifact without the vulnerability, to aws-prod, or
add an ignore entry for the vulnreability to the .snyk file

Note:

attestations are immutable and are recorded against the fingerprint of the artifact. It is impossible to fix the vulnerability on the actual running artifact.
critical: 0 means age_days(vuln) < 0, which is never true. Any critical
vulnerability in (or being deployed to) aws-prod is non-compliant.

The Rego policy lives in snyk-vuln-compliance.rego.

It starts by aliasing the data.params fields, giving them shorter names for use
throughout the policy.

package policy

import rego.v1

max_days_by_severity    := data.params.max_days_by_severity
max_ignore_expiry_days  := data.params.max_ignore_expiry_days
...

Compliance must be driven via a positive assertion

vuln_of reads the JSON from the attestation named snyk, the individual data attestation for one vulnerability:

vuln_of(trail) := trail.compliance_status.attestations_statuses["snyk"].attestation_data

age_days finds the age of a vulnerability in days:

seconds_per_day := 60 * 60 * 24
age_days(vuln) := (vuln.now_ts - vuln.first_seen_ts) / seconds_per_day

Then the core functionality:

allow defaults to false
allow is only set to true via a positive assertion through trail_is_compliant.
a trail is compliant when:
- there is no .snyk ignore entry and the vulnerability age is within the per-severity Rego threshold, or
- there is an active .snyk ignore entry (not expired, not too far in the future)

...
default allow := false

# Use < so that critical (max=0) is non-compliant on day zero
age_within_limit(vuln) if {
    vuln.ignore_expires_exists == false
    age_days(vuln) < max_days_by_severity[vuln.severity]
}

ignore_is_active(vuln) if {
    vuln.ignore_expires_exists == true
    vuln.ignore_expires_ts >= vuln.now_ts
    vuln.ignore_expires_ts <= vuln.now_ts + (max_ignore_expiry_days * seconds_per_day)
}

# Case 1: no .snyk ignore entry -- age determines compliance
# Case 2: .snyk ignore entry exists and is active (not expired and expiry date not  too far in the future) -- compliant regardless of age

trail_is_compliant(trail) if age_within_limit(vuln_of(trail))
trail_is_compliant(trail) if ignore_is_active(vuln_of(trail))

allow if trail_is_compliant(input.trail)
...

Never produce a false-positive compliant

The Rego evaluation must never incorrectly produce a false-positive compliant result. This can happen quite easily unless we understand OPA’s undefined behavour failure mode, and is why compliance must be driven via a positive assertion. In a compliance-path rule (one that can make allow true), an undefined reference is dangerous if the rule is negated. For example:

ignore_too_far_ahead(vuln) if {
    vuln.ignore_expires_exists == true
    vuln.ignore_expires_ts > vuln.now_ts + (max_ignore_expiry_days * seconds_per_day)
}

ignore_is_active(vuln) if {
    vuln.ignore_expires_exists == true
    vuln.ignore_expires_ts >= vuln.now_ts
    not ignore_too_far_ahead(vuln)      # dangerous
}

A missing max_ignore_expiry_days param would silently produces false compliance:

max_ignore_expiry_days * seconds_per_day is undefined
ignore_too_far_ahead fails to fire and is treated as false
not false is true
the guard passes vacuously
ignore_is_active fires
trail_is_compliant fires
allow is incorrectly true

The safe pattern is a positive assertion in place of the negation:

ignore_is_active(vuln) if {
    vuln.ignore_expires_exists == true
    vuln.ignore_expires_ts >= vuln.now_ts
    vuln.ignore_expires_ts <= vuln.now_ts + (max_ignore_expiry_days * seconds_per_day)
}

Now, if max_ignore_expiry_days is absent from the params file:

the third condition fails to evaluate
ignore_is_active also fails
allow defaults to false, the correct fail-safe outcome

Violations provide diagnostics only

Violations must not drive the allow decision. The absence of violations must not drive trail_is_compliant.

In a violations rule, an undefined reference causes the rule body to fail silently: no diagnostic message is produced. This is the safe failure mode – a lost message, not a lost check. (See OPA issue #1857.)

Case 1 violation: no .snyk ignore entry and vulnerability age exceeds the threshold for its severity

violations contains msg if {
    vuln := vuln_of(input.trail)
    vuln.ignore_expires_exists == false
    not age_within_limit(vuln)
    msg := sprintf(
        "trail '%v': %v severity vuln age %d days exceeds %d day limit for severity %v",
        [input.trail.name, vuln.full_id, age_days(vuln), max_days_by_severity[vuln.severity], vuln.severity],
    )
}

Case 2 violation: .snyk ignore entry exists but is not active

inactive_ignore_msg(trail) := msg if {
    vuln := vuln_of(trail)
    ignore_has_expired(vuln)
    msg := sprintf(
        "trail '%v': %v snyk ignore entry expired at %v",
        [trail.name, vuln.full_id, vuln.ignore_expires],
    )
}

inactive_ignore_msg(trail) := msg if {
    vuln := vuln_of(trail)
    ignore_too_far_ahead(vuln)
    msg := sprintf(
        "trail '%v': %v snyk ignore entry expiry %v is more than %d days ahead",
        [trail.name, vuln.full_id, vuln.ignore_expires, max_ignore_expiry_days],
    )
}

violations contains msg if {
    vuln := vuln_of(input.trail)
    vuln.ignore_expires_exists == true
    not ignore_is_active(vuln)
    msg := inactive_ignore_msg(input.trail)
}

Seeing it in action

The core artifact_snyk_test.yml workflow implements a zero trust snyk attestation. The Rego and params files control the policy.

it does a snyk container test on a specified artifact.
it writes a single artifact-level aggregate attestation, to the workflow’s trail.
the attestation is compliant if and only if every kosli evaluate trail returns zero for each aggregated, one per vulnerability, attestation.

Here is an example of an attestation called runner.snyk-container-scan.

runner-ci trail showing runner.snyk-container-scan attestation

https://app.kosli.com/cyber-dojo/flows/runner-ci/trails/a2ffba5a5debbc8f4f199cf5a88e5899c7d6547e?attestation_id=93067716-e6cf-43e8-9baa-e980e903

Each vuln_url_pass_NNN or vuln_url_fail_NNN annotation on this attestation is a link to its individual per-vulnerability trail. Here is the trail for 001 from above, a high-severity Go vulnerability found in runner:

Per-vulnerability trail runner-high-SNYK-GOLANG-GITHUBCOMAWSAWSSDKGOV2SERVICECLOUDWATCHLOGS-16316406 in snyk-aws-beta-per-vuln

https://app.kosli.com/cyber-dojo/flows/snyk-vulns-aws-beta/trails/runner-high-SNYK-GOLANG-GITHUBCOMAWSAWSSDKGOV2SERVICECLOUDWATCHLOGS-16316406?attestation_id=ab13eb4e-5df2-4615-aafe-b4cd5d5a

Let’s see in detail, in the yaml of three different GitHub workflows.

1. Build workflows

Every cyber-dojo microservice repo has a main.yml workflow that builds a Docker image. After the image is built, the snyk-container-scan job calls the core artifact_snyk_test.yml reusable workflow. Here is an example for the runner repo:

runner/.github/workflows/main.yml#L165

snyk-container-scan:
  needs: [build-image]
  uses: cyber-dojo/snyk-scanning/.github/workflows/artifact_snyk_test.yml@main
  with:
    artifact_name: ${{ needs.build-image.outputs.tagged_image_name }}
    kosli_flow: ${{ vars.KOSLI_FLOW }}
    kosli_trail: ${{ github.sha }}
    kosli_attestation_name: runner.snyk-container-scan
  secrets:
    snyk_token: ${{ secrets.SNYK_TOKEN }}
    kosli_api_token: ${{ secrets.KOSLI_API_TOKEN }}

The artifact_name is the just-built image we want to scan.
The flow is named after the repo (eg runner-ci for the runner repo).
The trail name is the git commit.
The attestation name is runner.snyk-container-scan
The snyk_token gives permission to run the scan
The kosli_api_token gives permission to write the attestation

The workflow’s subsequent sdlc-control-gate job “gates” the deployment of the runner artifact to its aws-beta environment, using kosli assert artifact

runner/.github/workflows/main.yml#L315

  sdlc-control-gate:
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    needs:
      - build-image
      - snyk-container-scan
      ...
    env:
      KOSLI_FINGERPRINT: ${{ needs.build-image.outputs.digest }}
    steps:
      ...
      - name: Kosli SDLC gate to short-circuit the workflow
        run:
          kosli assert artifact
            --environment="${KOSLI_AWS_BETA}"

The environment policy of aws-beta includes a compliant snyk attestation:

if the artifact has a non-compliant snyk attestation the kosli assert artifact command will exit with non-zero, we will not reach the deployment job, and the image will not be deployed to aws-beta.
if the artifact has a compliant snyk attestation, and meets all other aws-beta policy requirements, the kosli assert artifact command will exit with zero, we will reach the deployment job, and the image will be deployed to aws-beta

2. Promotion workflows

The aws-prod-co-promotion repo holds workflows for promoting one or more cyber-dojo artifacts from aws-beta to aws-prod. The promote-all.yml workflow runs the core artifact_snyk_test.yml workflow, for each artifact (found using kosli get snapshot), this time applying the stricter aws-prod Rego params:

aws-prod-co-promotion/.github/workflows/promote_all.yml#L91

snyk-scan:
  needs: [setup, find-promotions]
  strategy:
    matrix:
      include: ${{ fromJSON(needs.find-promotions.outputs.promotions) }}
  uses: cyber-dojo/snyk-scanning/.github/workflows/artifact_snyk_test.yml@main
  with:
    artifact_name:          ${{ matrix.incoming_image_name }}
    kosli_flow:             ${{ vars.KOSLI_FLOW }}
    kosli_trail:            ${{ needs.setup.outputs.kosli_trail }}
    kosli_attestation_name: ${{ matrix.incoming_repo_name }}.snyk-scan
    kosli_env:              ${{ vars.KOSLI_AWS_PROD }}
    repo_name:              ${{ matrix.incoming_repo_name }}
    raw_snyk_policy_url:    https://raw.githubusercontent.com/cyber-dojo/${{ matrix.incoming_repo_name }}/${{ matrix.incoming_commit_sha }}/.snyk
  secrets:
    snyk_token:      ${{ secrets.SNYK_TOKEN }}
    kosli_api_token: ${{ secrets.KOSLI_API_TOKEN }}

Note:

The snyk-scan jobs run in parallel for all artifacts being promoted.
Each “aggregate” attestation (one per artifact) is written to a flow representing this workflow’s promotion process: vars.KOSLI_FLOW == production-promotion.
kosli_trail is promote-all-${{ github.run_number }}.
kosli_env (which defaults to aws-beta) is set to aws-prod, to pick up the stricter thresholds in rego.params.aws-prod.json.
raw_snyk_policy_url controls the identity of the .snyk policy file (which defaults to the .snyk file in the current commit) is matrix.incoming_commit_sha, the exact commit SHA that built the artifact being promoted, not HEAD.

Once again, the workflow’s subsequent sdlc-control-gate job “gates” the deployment of the artifacts to the aws-prod environment, using kosli assert artifact

aws-prod-co-promotion/.github/workflows/promote_all.yml#L135

  sdlc-control-gate:
    if: ${{ needs.find-promotions.outputs.promotions != '[]' }}
    needs:
      - setup
      - find-promotions
      - snyk-scan
    runs-on: ubuntu-latest
    strategy:
      matrix:
        include: ${{ fromJSON(needs.find-promotions.outputs.promotions) }}
    env:
      KOSLI_TRAIL: ${{ needs.setup.outputs.kosli_trail }}
      ...
    steps:
      ...
      - name: Assert Artifact is compliant for aws-prod
        run: |
          ...
          kosli assert artifact \
            --fingerprint "${{ matrix.incoming_fingerprint }}" \
            --environment "${KOSLI_AWS_PROD}"

The environment policy of aws-prod also includes a compliant snyk attestation:

if any artifact has a non-compliant snyk attestation the kosli assert artifact command will exit with non-zero, we will not reach the deployment job, and no images will be promoted to aws-prod.
if all artifacts have a compliant snyk attestation, and meet all other aws-prod policy requirements, the kosli assert artifact commands will exit with zero, we will reach the final deployment job, and all images will be deployed to aws-prod

3. Continuous environment scanning workflows

An image can pass its Snyk scan at build or promotion time but become vulnerable later. CVE databases are updated continuously, and new vulnerabilities are regularly discovered in packages that were considered safe when the image was built or deployed.

Two scheduled workflows, aws-beta.yml and aws-prod.yml, run once a day and scan every artifact currently running in each environment. They also trigger automatically when the Rego file or the relevant params file changes.

They call env_snyk_test.yml which finds all artifacts currently running in the environment, again using kosli get snapshot, and fan out to artifact_snyk_test.yml via a matrix strategy for each artifact:

find-artifacts:
  ...
  steps:
    - name: Generate JSON for each Artifact in KOSLI_ENV
      id: set-artifacts
      run: |
        artifacts="$(make artifacts | jq --raw-output --compact-output .)"
        echo "artifacts=${artifacts}" >> ${GITHUB_OUTPUT}        

artifact-snyk-test:
  needs: find-artifacts
  strategy:
    matrix:
      include: ${{fromJSON(needs.find-artifacts.outputs.artifacts)}}
  uses: ./.github/workflows/artifact_snyk_test.yml
  with:
    artifact_name:          ${{matrix.artifact_name}}
    kosli_flow:             ${{inputs.kosli_flow}}
    kosli_trail:            ${{matrix.repo_name}}-${{matrix.artifact_fingerprint}}
    kosli_attestation_name: ${{matrix.repo_name}}.snyk-container-scan
    kosli_env:              ${{inputs.kosli_env}}
    raw_snyk_policy_url:    ${{matrix.raw_snyk_policy_url}}

The attestations are written to a dedicated flow, called snyk-{env}-per-artifact, which holds one trail per artifact fingerprint, named {repo_name}-{fingerprint}. Scanning the same artifact twice produces a new attestation on the same trail. This records the full scan history for each artifact fingerprint as it runs in the environment.

As always, each attestation is made against the artifact fingerprint. Any non-compliant attestation will cause the environment to become non-compliant.

We need to know how many days until the next vulnerability will cause non-compliance, so that we can decide when to act and try to avoid non-compliance. A workflow job calls a simple python script find_expiring_vulns.py to find this information by reading the vuln-*.json files produced during the current scan run and send a slack channel message. If the workflow fails we also detect that and send an error message to the slack channel.

slack message showing when next vulnerability will cause non-compliance

The slack message includes two links, one to a how-to-respond guide, and one to the Github workflow log, which includes step summaries of all vulnerabilities per artifact/repo, plus a table showing the vulnerablities across all artifacts by severity.

Github workflow log summary showing snyk vulnerabilities per artifact

Github workflow log summary showing snyk vulnerabilities per severity

Testing

The tests live in the tests/ dir:

test_rego_rules.sh
tests the Rego policy directly using kosli evaluate input, constructing JSON input for each case and asserting on allow and violations. It covers all four compliance cases against both the aws-beta and aws-prod params files. It also has tests that guard against OPA’s undefined behaviour.
test_rego_params.sh guards the invariant that every aws-prod limit is at most the equivalent aws-beta limit. It checks all four max_days_by_severity values (critical, high, medium, low) and max_ignore_expiry_days.
test_combine_snyk.sh exercises bin/combine_snyk.py end-to-end with real
SARIF fixture files and .snyk YAML fixtures. Test cases: zero vulnerabilities, one medium vulnerability without an ignore entry, one medium vulnerability with an active ignore entry, and two vulnerabilities.
test_artifacts.sh covers bin/artifacts.py, which converts the
kosli get snapshot JSON into a GitHub Actions matrix. It tests GitHub and GitLab repository URL formats.
test_find_expiring_vulns_logic.py covers bin/find_expiring_vulns.py, which identifies currently-compliant vulnerabilities that are approaching their deadline. It tests the two functions that drive the decision: dot_snyk_result (returns a result when a .snyk ignore entry has a future expiry date) and rego_result (returns a result when vulnerability age is still within the per-severity limit).
test_print_expiring_vulns_summary.sh covers bin/print_expiring_vulns_summary.py, which formats the JSON output from find_expiring_vulns.py as a Markdown step summary with one table per Snyk severity level (low, medium, high, critical), each sorted by days_remaining ascending.

Summary

We started with three problems: ignored vulnerabilities were silently filtered out of the SARIF output before Kosli ever saw them, so there was no visibility into what was being suppressed or why; a single all-or-nothing snyk attestation meant any new CVE blocked deployment regardless of severity; and with one attestation covering all vulnerabilities, there was no way to track individual CVEs, their age, or their compliance status over time.

Three workflows, CI build, promotion, and continuous scanning, now share the same artifact_snyk_test.yml workflow, with different inputs for kosli_env, kosli_flow, kosli_trail, and kosli_attestation_name. The Rego policy is the single source of truth for compliance rules. The per-environment params files control the thresholds.

kosli evaluate trail(s) is what ties it all together and makes an aggregate attestation architecture viable: one data attestation per vulnerability, evaluated individually, aggregated into a single artifact-level result.

New CVEs no longer block the development workflow the moment they appear causing bursts of Snyk whack-a-mole. Neither do they necessarily cause an environment to immediately turn non-compliant. Both are controlled by a grace period determined by policy.