Llama-Lawyer Experiment

This is a collection of tools and config files we use to fine-tune, benchmark and deploy Llama-3 as a text classifier for Cavil.

Background

Cavil uses a pattern matching system to identify potential legal text in source code. This process is based around identifying hot zones of legal keywords (snippets) and produces around 80% false positives. Historically these false positives had to be sorted out by humans. So a few years ago we've started to use a Character-level Convolutional Network to automate much of this process with machine learning. Today we train the model on 150.000 samples, and reach about 96% accuracy. The model has to be re-trained regularly though to maintain that level.

Large language models such as Llama-3 presented an opportunity for a text classifier that can reach a similar level of accuracy, but that already has a deep enough understanding of human language that it will require much less re-training.

Preparation

You need:

A checkout of this repo.
Copy of the Meta-Llama-3-8B-Instruct base model.
And the not yet open sourced 150.000 samples of training data from the SUSE production instance of Cavil. Sorry!

Process

# Install dependencies
python -m venv .venv
./.venv/bin/python -m pip install -r requirements.txt

# Convert full LegalDB training data to alpaca format (ready for upload to HF)
./.venv/bin/python convert.py -i legaldb-ml-data -o data.json -f alpaca

# Convert subset of LegalDB training data to datasets format (1000 samples of each type for testing)
./.venv/bin/python convert.py -i legaldb-ml-data -o legaldb-ml-data-small.jsonl -f datasets -l 1000

# Test to get a base accuracy (already 75% for Llama-3-8B)
./.venv/bin/python test.py -i legaldb-ml-data-small.jsonl -m /tmp/Meta-Llama-3-8B-Instruct

# Fine-tune Llama-3 with torchtune and LegalDB training data (takes about 8 hours with an RTX 4090)
./.venv/bin/tune run lora_finetune_single_device --config torchtune.yaml dataset.source=kraih/legaldb-training-full-0.1

# HACK: Convert Llama-3 checkpoint to a format transformers will accept
# (see https://github.com/pytorch/torchtune/issues/832 for more)
cp /tmp/Meta-Llama-8B-Instruct/meta_model_0.pt /tmp/Meta-Llama-3-8B-Instruct/original/consolidated.00.pth
./.venv/bin/python convert_model.py --input_dir /tmp/Meta-Llama-3-8B-Instruct --output_dir /tmp/Meta-Llama-8B-Instruct-Cavil-hf

# Test again to verify improved accuracy (should be about 96% now)
./.venv/bin/python test.py -i legaldb-ml-data-small.jsonl -m /tmp/Meta-Llama-3-8B-Instruct-Cavil-hf

# Start server for use with Cavil
./.venv/bin/python server.py -p 5000 -m /tmp/Meta-Llama-3-8B-Instruct-Cavil-hf

# Verify server works properly, expected result: {"license":true, "confidence":"92.39"}
curl -X POST --data '# MIT License' http://127.0.0.1:5000

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert.py		convert.py
requirements.txt		requirements.txt
server.py		server.py
test.py		test.py
torchtune.yaml		torchtune.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Llama-Lawyer Experiment

Background

Preparation

Process

About

Releases

Packages

Languages

License

kraih/llama-lawyer

Folders and files

Latest commit

History

Repository files navigation

Llama-Lawyer Experiment

Background

Preparation

Process

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages