Skip to content

Latest commit

 

History

History
207 lines (167 loc) · 7.88 KB

USAGE.md

File metadata and controls

207 lines (167 loc) · 7.88 KB

Usage guide

Running an experiment requires three steps:

  1. Install dependencies.
  2. Setting up LLM access.
  3. Launch experiment.

Prerequisites

Dependencies

You must install:

  1. Python 3.11
  2. pip
  3. python3.11-venv
  4. Git
  5. Docker
  6. Google Cloud SDK
  7. c++filt must be available in PATH.
  8. (optional for project_src.py) clang-format

Python Dependencies

Install required dependencies in a Python virtual environment:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

LLM Access

Setup Vertex AI or OpenAI with the following steps.

Vertex AI

Accessing Vertex AI models require a Google Cloud Project (GCP) with Vertex AI enabled.

Then auth to GCP:

gcloud auth login
gcloud auth application-default login
gcloud auth application-default set-quota-project <your-project>

You'll also need to specify the GCP projects and locations where you have Vertex AI quota (comma delimited):

export CLOUD_ML_PROJECT_ID=<gcp-project-id>
export VERTEX_AI_LOCATIONS=us-west1,us-west4,us-east4,us-central1,northamerica-northeast1

OpenAI

OpenAI requires an API key.

Then set it as an ENV variable:

export OPENAI_API_KEY='<your-api-key>'

Running experiments

To generate and evaluate the fuzz targets in a benchmark set via local experiments:

./run_all_experiments.py \
    --model=<model-name> \
    --benchmarks-directory='./benchmark-sets/comparison' \
    [--ai-binary=<llm-access-binary>] \
    [--template-directory=prompts/custom_template] \
    [--work-dir=results-dir]
    [...]
# E.g., generate fuzz targets for TinyXML-2 with default template and fuzz for 30 seconds.
# ./run_all_experiments.py -y ./benchmark-sets/comparison/tinyxml2.yaml

where the <model-name> can be:

  1. vertex_ai_code-bison or vertex_ai_code-bison-32k for the Code Bison models on Vertex AI.
  2. vertex_ai_gemini-pro for Gemini Pro on Vertex AI.
  3. gpt-3.5-turbo or gpt-4 for OpenAI.

Experiments can also be run on Google Cloud using Google Cloud Build. You can do this by passing --cloud <experiment-name> --cloud-experiment-bucket <bucket>, where <bucket> is the name of a Google Cloud Storage bucket your Google Cloud project.

Benchmarks

We currently offer two sets of benchmarks:

  1. comparison: A small selection of OSS-Fuzz C/C++ projects.
  2. all: All benchmarks across all OSS-Fuzz C/C++ projects.

Visualizing Results

Once finished, the framework will output experiment results like this:

================================================================================
*<project-name>, <function-name>*
build success rate: <build-rate>, crash rate: <crash-rate>, max coverage: <max-coverage>, max line coverage diff: <max-coverage-diff>
max coverage sample: <results-dir>/<benchmark-dir>/fixed_targets/<LLM-generated-fuzz-target>
max coverage diff sample: <results-dir>/<benchmark-dir>/fixed_targets/<LLM-generated-fuzz-target>

where <build-rate> is the number of the fuzz targets that can compile over the total number of fuzz target generated by LLM (e.g., 0.5 if 4 out of 8 fuzz targets can build), <crash-rate> is the run-time crash rate, <max-coverage> measures the maximum line coverage of all targets, and <max-coverage-diff> shows the max new line coverage of LLM-generated targets against existing human-written targets in OSS-Fuzz.

Note that <max-coverage> and <max-coverage-diff> are computed based on the code linked against the fuzz target, not the whole project. For example:

================================================================================
*tinyxml2, tinyxml2::XMLDocument::Print*
build success rate: 1.0, crash rate: 0.125, max coverage: 0.29099427381572096, max line coverage diff: 0.11301753077209996
max coverage sample: <result-dir>/output-tinyxml2-tinyxml2-xmldocument-print/fixed_targets/08.cpp
max coverage diff sample: <result-dir>/output-tinyxml2-tinyxml2-xmldocument-print/fixed_targets/08.cpp

Results report

To visualize these results via a web UI, with more details on the exact prompts used, samples generated, and other logs, run:

python -m report.web <results-dir> <port>

Where <results-dir> is the directory passed to --work-dir in your experiments (default value ./results).

Then navigate to http://localhost:<port> to view the result in a table.

Detailed workflows

Configure and use framework in the following five steps:

  1. Configure benchmark
  2. Setup prompt template
  3. Generate fuzz target
  4. Fix compilation error
  5. Evaluate fuzz target

Configure Benchmark

Prepare a benchmark YAML that specifies the function to test, here is an example. Follow the link above to automatically generate one for a C/C++ project in OSS-Fuzz. Note that the project under test needs to be integrated into OSS-Fuzz to build.

Setup Prompt Templates

Prepare prompt templates. The LLM prompt will be constructed based on the files in this directory. It starts with a priming to define the main goal and important notices, followed by some example problems and solutions. Each example problem is in the same format as the final problem (i.e., a unction signature to fuzz), and the solution is the corresponding human-written fuzz target for different functions from the same project or other projects. Prompt can also include more information of the function (e.g., its usage, source code, or parameter type definitions), and model-specific notes (e.g., common pitfalls to avoid).

You can pass an alternative template directory via --template-directory. The new template directory does not have to include all files: The framework will use files from template_xml/ by default when they are missing. The default prompt is structured as follows:

<Priming>
<Model-specific notes>
<Examples>
<Final question + Function information>

Generate Fuzz Target

The script run_all_experiments.py will generate fuzz targets via LLM using the prompt constructed above and measure their code coverage. All experiment data will be saved into the --work-dir.

Fix Compilation Error

When a fuzz target fails to build, the framework will automatically make five attempts to fix it before terminate. Each attempt asks LLM to fix the fuzz target based on the build failure from OSS-Fuzz, parses source code from the response, and re-compiles it.

Evaluate Fuzz Target

If the fuzz target compiles successfully, the framework fuzzes it with libFuzzer and measures its line coverage. The fuzzing timeout is specified by --run-timeout flag. Its line coverage is also compared against existing human-written fuzz targets from OSS-Fuzz in production.

Development

Contribution process

Development environment

Auto Format / Lint

You can a Git pre-push hook to auto-format/-lint your code:

./helper/add_pre-push_hook

Or manually run the formater/linter by running:

.github/helper/presubmit

Updating Dependencies

We use https://github.com/jazzband/pip-tools to manage our Python dependencies.

# Edit requirements.in
pip install pip-tools  # Required to re-generate requirements.txt from requirements.in
pip-compile requirements.in > requirements.txt
pip install -r requirements.txt