Taint-Tracking Workflows to Prevent Injection from Generated Code

AI-generated code and runtime inputs increase the risk surface for injection and path traversal. This article gives practical taint-tracking workflows you can add to CI, concrete rule and configuration patterns for CodeQL and Semgrep, and remediation/triage steps for findings that commonly arise from generated code.

1) Decide your scope and threat model

Define taint sources (user input, agent/tool return values, request bodies, LLM outputs) and sinks (eval, subprocess calls, SQL execution, filesystem APIs, template renderers). Treat LLM outputs that are interpolated or passed to tool functions as tainted by default.

2) CI placement and ordering

Run lightweight language-level type checks first (mypy/tsc) to reduce noise, then run taint-enabled analyzers in parallel where possible. Prefer running Semgrep (fast rule-based taint) on every PR and CodeQL (deeper interprocedural taint) on merge/pre-release branches or nightly builds.

3) Semgrep: fast taint rules for PR feedback

Configuration highlights:

– Use mode: taint in rule YAML and declare pattern-sources, pattern-sinks, optional pattern-propagators and pattern-sanitizers.

– Treat agent/LLM API calls and tool wrappers as sources. Example source pattern (Python):

pattern-sources:- pattern: llm.generate(...)

– Define sinks for dangerous APIs (e.g., subprocess.run($C), cursor.execute($Q), template engines). Add sanitizers that your codebase uses (parameterized queries, shlex.quote, prepared statements).

– If you need interprocedural traces across files, enable Semgrep Pro interfile options in CI; otherwise intrafile interprocedural (–pro-intrafile) gives useful cross-function traces inside a file.

– Tune for generated-code patterns: flag interpolation into format strings, f-strings, .format(), or shell concatenation where the format contains tainted metavariables.

4) CodeQL: deep, interprocedural taint tracking

Key practices:

– Use the newer CodeQL taint-tracking API (TaintTracking configuration) to declare sources and sinks. Make LLM-related APIs explicit sources (e.g., wrapper function that returns generated text).

– Create custom flow states for high-risk paths: require that flows pass through particular checks or sanitizers by encoding them as state transitions in your TaintTracking::Configuration.

– Run CodeQL on merge or nightly; use its variant-analysis capabilities to find multiple occurrences of the same fragile pattern.

5) Rule and sanitizer examples to reduce false positives

– Mark parameterized DB calls (cursor.execute(query, params)) as safe for SQL-injection sinks.

– Treat canonical escaping helpers (shlex.quote, html.escape, prepared-statement APIs) as sanitizers; add project-specific wrappers if used.

– For path-traversal, consider sanitizers that canonicalize paths (os.path.abspath + whitelist check) or use safe APIs (open with sandboxed base dir). Declare those as sanitizers in taint configs.

6) Typical findings from AI-generated code and triage patterns

– Finding: tainted string reaches subprocess or eval. Triage: verify whether the value flows from an LLM/agent source; if so, check for missing escaping or improper concatenation. Remediation: use argument lists (subprocess.run([cmd, arg]) not shell=True), validate/whitelist inputs, or sandbox execution.

– Finding: tainted value used in SQL or template. Triage: confirm whether parameterized queries or template auto-escaping are actually in use. Remediation: convert to parameter binding, escape or validate against a whitelist, or move templating to safe renderers.

– Finding: path input reaches filesystem APIs. Triage: check for path normalization and directory-escape checks. Remediation: canonicalize and verify that path resides under an allowed base directory; reject suspicious segments (“..”) and enforce whitelists.

7) CI workflow and alert handling

– Block PR merges on high-confidence taint-to-sink findings by default; low-confidence or findings in generated test fixtures can be reported as warnings.

– Include dataflow traces in the issue template: source location, propagation steps, sink location, and suggested sanitizer. Store reproducible minimal examples when possible to speed triage.

8) Continuous tuning

– Periodically add rules targeting repeated mistakes from your LLM prompts (e.g., unsafe helper code patterns). Maintain a suppressions/false-positive list separate from the rule set so suppressions require justification and a reviewer.

– Use variant analysis (CodeQL) or aggregated Semgrep dashboards to find classes of repeated defects introduced by generated code and update prompt templates or code-generation adapters to avoid them.

9) Example quick-start checklist for CI

1. Run type checks (mypy/tsc). 2. Run Semgrep taint rules (PR). 3. On merge/nightly run CodeQL taint queries. 4. Block on high-confidence flows to dangerous sinks. 5. Triage and remediate with sanitizers or API changes. 6. Add targeted Semgrep/CodeQL rules for recurring patterns.

Implementing the steps above turns taint tracking into a focused defense against injection and path-traversal introduced by generated code, while keeping the signal-to-noise ratio manageable for developer workflows.

Sources

日本語