Taking Gen AI apps from prototype to production

afida · 11-20-2023 05:43 PM

Generative AI (Gen AI) applications are becoming increasingly popular across a wide range of domains and fields, and with new models, tools, and frameworks, you can quickly build prototypes to demonstrate functionality.

Moving these to production can be challenging, however, and these challenges can range from keeping models up-to-date and scaling to supporting responsible, ethical AI. This blog post discusses some of the considerations and recommendations of transitioning Gen AI applications to production.

Prompt Engineering and LLM Tuning

Prompt engineering is the process of designing prompts that can effectively guide the output of large language models (LLMs). LLMs are powerful AI systems that can generate text, translate languages, and answer questions in a comprehensive and informative way. However, their output is highly dependent on the quality of the prompts they are given. Prompt engineering is therefore important for Gen AI applications because it allows developers to tune the behavior of LLMs without having to modify the models themselves. Application developers and/or prompt engineering should understand the best practices for prompt engineering. In addition, Retrieval Augmented Generation (RAG) has become a prevalent mechanism to extend prompt engineering to the next level. An example of this can be reviewed here.

LLM tuning is the process of adjusting the parameters of an LLM to improve its performance on a specific task. This can be done by tuning the model on a dataset of examples, or by using techniques such as prompt engineering.

LLM tuning is important for Gen AI applications because it can be used to improve the performance of LLMs on tasks that they are not already well-suited for. For example, an LLM that has been fine-tuned on a dataset of medical articles may be able to generate more accurate and informative answers to medical questions. Similar to RAG, described above, ReACT is a pattern to augment LLM limitations of being frozen in time and being unable to query or modify external data. Google Vertex AI is providing ReACT support through an extension framework.

In general, prompt engineering is a more flexible and efficient approach than LLM tuning. However, LLM tuning may be necessary for tasks that require very high levels of performance. Google Vertex AI simplifies the overall tasks involved with LLM tuning.

Evaluation

During the Gen AI lifecycle, starting with use case identification, prompt engineering, and/or LLM tuning, it is critical to continuously evaluate the accuracy of responses being generated. Such evaluation can be conducted using a variety of techniques and methodologies, including:

Human evaluation: Assessing the quality of LLM outputs based on criteria such as accuracy, fluency, coherence, and relevance. This can be done by asking humans to rate the outputs on a scale, or by providing feedback on specific aspects of the output. This approach is sometimes referred to as Reinforcement Learning with Human Feedback (RLHF) and is natively supported by Google Cloud Vertex AI.
Automatic evaluation: Measuring the performance of LLMs on specific tasks. Some common metrics include BLEU score, perplexity , and F1 score . Gen AI platforms such as Google Vertex AI are starting to support out-of-the-box support for implementing such automated evaluations.
Adversarial evaluation: Crafting inputs that are designed to be difficult for LLMs to handle. This can be used to identify weaknesses in the model and to improve its robustness. Some examples of such inputs are through input prompts and include bias amplification, misleading instructions, false premises, ambiguous prompts, and emotional manipulation. .
Benchmarking: Evaluating LLMs on a set of standard tasks and comparing their performance to other models. This can be used to track the progress of LLM development and to identify the leading models in the field.

By using a combination of the techniques and methodologies described above, researchers and developers can gain a comprehensive understanding of the accuracy of LLMs on a variety of tasks. This information can then be used to improve the performance of LLMs and to develop new and innovative applications for these powerful language models.

Scalability

Scalability is a key attribute of the Gen AI application lifecycle. From model tuning to model serving, scalable infrastructure ensures that infrastructure bottlenecks will not impede user experience with the application. This level of immediate rescaling is not feasible with traditional, on-site servers, which is why developers rely so heavily on the automated scalability of cloud servers when building Gen AI applications. Some fundamental scalability aspects include:

Infrastructure: Large datasets and model fine tuning can require large compute, memory, storage, and networking. Therefore, it is important to ensure that you have infrastructure in place that can scale with your applications’ needs. Managed infrastructure like Google Cloud’s platform, GPUs, and TPUs offer cost-efficient solutions for scalability in an automated fashion.
Model serving: During application usage, model serving provides inference capabilities. A scalable serving design should address inference latency, security, and cost considerations. Having the automated up and down scaling provided by cloud providers can generally help address such requirements.

Observability, operations, and change management

There are unique challenges to implementing observability for Gen AI apps, whether for accuracy, safety, bias, or other aspects where continuous observability is essential to ensure desired application behavior. There are solutions starting to emerge to aid with observability needs of Gen AI applications. The following table provides examples of some metrics that you should consider when designing an observability solution for Gen AI.

Performance	Accuracy: The regularity which the model generates correct or desired outputs BLEU score: The quality of machine-translated text ROUGE score: The quality of automatically generated summaries Perplexity: The frequency that the language model predicts the next word in a sequence Latency: The time it takes for the model to generate an output Throughput: The number of outputs the model can generate per second
Usage	CPU utilization: The percentage of CPU used by the model Memory utilization: The percentage of memory used by the model GPU utilization: The percentage of GPU used by the model
Model	Feature drift: How much the distribution of feature values in production has changed from the distribution of feature values in the training data Model drift: How much the performance of the model has changed over time Explainability: How well the model can explain its outputs Fairness: How fair the model is to different groups of users
Other	Traffic: The number of requests received by the model Availability: The percentage of time that the model is available to serve requests Cost: The cost of running the model

In addition, traditional operational best practices of MLOps such as IaC-based automations, redundancy, elasticity, testing, artifact, and source management should all extend to Gen AI applications to include LLM-related infrastructure, data, and security requirements.

Finally, you also especially need to extend change management to custom, fine-tuned LLM. This change management ensures tracking of changes across foundational and tuned LLM lifecycles.

Security and social responsibility

While Gen AI offers significant potential, to make apps production ready, you have to prioritize security and compliance while ensuring the application follows responsible AI best practices. This includes familiarizing yourself with potential threats, and having a plan to mitigate them. Some of such mitigation techniques include:

Data anonymization: This involves removing or altering personally identifiable information (PII) from the data that is used to train and fine-tune the LLM
Differential privacy: This mathematical technique adds noise to the data that is used to train and fine-tune the LLM, making it more difficult to identify individuals from the data
Secure multi-party computation (SMPC): This technique allows multiple parties to train an LLM on their combined data without revealing their individual data to each other
Homomorphic encryption: This encryption allows computations to be performed on encrypted data without requiring decryption. This can be used to protect the privacy of the data that is used to train and fine-tune the LLM, as well as the privacy of the LLM's responses

In addition to these mitigation techniques, there are potential attack vectors to consider. Gen AI models can be vulnerable to adversarial attacks, so it’s important to implement security measures to protect your model from attack techniques such as prompt injections, model inversion, and backdoors.

Likewise, you should prepare a plan to implement responsible AI. Improperly trained AI models can cause direct or indirect harm to society by introducing bias, particularly against protected groups or certain sections of society. To mitigate this risk, it is important to protect against bias by leveraging diverse datasets, running adversarial training to identify and remove bias from the model, and implementing bias and safety filters to reduce the risk of the app generating harmful or offensive content. Use features like Grounding to establish lineage to facts.

In conclusion, taking a Gen AI application to production is a complex task, but it is essential for realizing the full potential of Gen AI. By considering the approaches discussed above across various phases you will be better equipped to develop cutting-edge Gen AI applications.

Want to learn more? Check out the full Gen AI Bootcamp workshop on-demand now!

Have questions? Please leave a comment below.

Authored by:

Adnan Fida, Technical Lead, Gen AI Specialist @afida
Maruthi Tumuluri, Gen AI Specialist @Venk