Llama3 on VertexAI returning bogus responses

Hi,

I have a llama3-70b-001 model deployed to Vertex AI via the Model Garden. I want to get predictions via the REST API from a Node.js application.

Here's the request I am making:

 

const response = await fetch(`https://${region}-aiplatform.googleapis.com/v1/projects/${project}/locations/us-west4/endpoints/${endpoint}:predict`, {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${token}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        instances: [
          {
            prompt: 'You are a career advisor. Give me 10 tips for a good CV.',
          },
        ],
        parameters: {
          max_output_tokens: maxTokens,
          temperature,
        },
      }),
      cache: 'no-store',
    });

 

Here's the response I am getting.

 

{
  predictions: [
    'Prompt:\n' +
      'You are a career advisor. Give me 10 tips for a good CV.\n' +
      'Output:\n' +
      ' _Use the phrases in the box_.\n' +
      '\\begin{tabular}{l'
  ],
  deployedModelId: <redacted>,
  model: <redacted>,
  modelDisplayName: 'llama3-70b-001',
  modelVersionId: '1'
}

 

I have a couple of questions:

  • Why do I get a cut off response? I am passing a large max tokens.
  • Why is the response seemingly unrelated to the question?
  • Why does the response repeat the prompt?
  • Even though I pass temperature 0, I get a wildly different response every time.

I have tried with llama-3-70b-chat-001 as well, with similar results. The documentation on how to pass parameters to specific models is lacking, or at least I couldn't find it.

Thanks!

 

 

2 6 451
6 REPLIES 6

Can you try using special tokens?

 

<|start_header_id|>system<|end_header_id|>
{System}
<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{User}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

 

 See this for reference. (URL Removed by Staff)

That didn't work:

const userPrompt = 'Give me 10 tips for making a great CV';
    const systemPrompt = 'You are a helpful assistant.';
    const response = await fetch(`https://us-west4-aiplatform.googleapis.com/v1/projects/${project}/locations/us-west4/endpoints/596894076394012672:predict`, {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${token}`,
        'Content-Type': 'application/json',
      },
      body: JSON.stringify({
        instances: [
          {
            prompt: `<|start_header_id|>system<|end_header_id|>
            ${systemPrompt}
            <|eot_id|>
            <|start_header_id|>user<|end_header_id|>
            ${userPrompt}
            <|eot_id|><|start_header_id|>assistant<|end_header_id|>`,
          },
        ],
        parameters: {
          // max_output_tokens: maxTokens,
          temperature,
        },
      }),
      cache: 'no-store',
    });

Returns:

{
  predictions: [
    'Prompt:\n' +
      '<|start_header_id|>system<|end_header_id|>\n' +
      '            \n' +
      '            <|eot_id|>\n' +
      '            <|start_header_id|>user<|end_header_id|>\n' +
      '            \n' +
      '            <|eot_id|><|start_header_id|>assistant<|end_header_id|>\n' +
      'Output:\n' +
      ' that are easy to aver fit absorbance and have an amount of heat, presents'
  ],
  ...
}

Did string interpolation happen?  looks like userPrompt and systemPrompt are missing?

I am interpolating userPrompt and systemPrompt correctly when I construct my request. However, I am probably not passing the body/instances that llama 3 on vertex expects. Do you have a real working example of making a request to llama 3 on vertex via REST?

I'm having the same issue. Can anyone share example object to send for this to work? I deployed llama3-8b-chat001 model on Vertex but the answers I get are totally random. Please give us an example object where we can set system and user prompt properly.

I got it to work after finding that example https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/model_garden/....

Request body looks something like this according to the example:

instances = [
    {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "raw_response": raw_response,
    },
]

Setting "raw_response" to true only give you the generated output.