Generative AI (Gen AI) applications are becoming increasingly popular across a wide range of domains and fields, and with new models, tools, and frameworks, you can quickly build prototypes to demonstrate functionality.
Moving these to production can be challenging, however, and these challenges can range from keeping models up-to-date and scaling to supporting responsible, ethical AI. This blog post discusses some of the considerations and recommendations of transitioning Gen AI applications to production.
Prompt engineering is the process of designing prompts that can effectively guide the output of large language models (LLMs). LLMs are powerful AI systems that can generate text, translate languages, and answer questions in a comprehensive and informative way. However, their output is highly dependent on the quality of the prompts they are given. Prompt engineering is therefore important for Gen AI applications because it allows developers to tune the behavior of LLMs without having to modify the models themselves. Application developers and/or prompt engineering should understand the best practices for prompt engineering. In addition, Retrieval Augmented Generation (RAG) has become a prevalent mechanism to extend prompt engineering to the next level. An example of this can be reviewed here.
LLM tuning is the process of adjusting the parameters of an LLM to improve its performance on a specific task. This can be done by tuning the model on a dataset of examples, or by using techniques such as prompt engineering.
LLM tuning is important for Gen AI applications because it can be used to improve the performance of LLMs on tasks that they are not already well-suited for. For example, an LLM that has been fine-tuned on a dataset of medical articles may be able to generate more accurate and informative answers to medical questions. Similar to RAG, described above, ReACT is a pattern to augment LLM limitations of being frozen in time and being unable to query or modify external data. Google Vertex AI is providing ReACT support through an extension framework.
In general, prompt engineering is a more flexible and efficient approach than LLM tuning. However, LLM tuning may be necessary for tasks that require very high levels of performance. Google Vertex AI simplifies the overall tasks involved with LLM tuning.
During the Gen AI lifecycle, starting with use case identification, prompt engineering, and/or LLM tuning, it is critical to continuously evaluate the accuracy of responses being generated. Such evaluation can be conducted using a variety of techniques and methodologies, including:
By using a combination of the techniques and methodologies described above, researchers and developers can gain a comprehensive understanding of the accuracy of LLMs on a variety of tasks. This information can then be used to improve the performance of LLMs and to develop new and innovative applications for these powerful language models.
Scalability is a key attribute of the Gen AI application lifecycle. From model tuning to model serving, scalable infrastructure ensures that infrastructure bottlenecks will not impede user experience with the application. This level of immediate rescaling is not feasible with traditional, on-site servers, which is why developers rely so heavily on the automated scalability of cloud servers when building Gen AI applications. Some fundamental scalability aspects include:
There are unique challenges to implementing observability for Gen AI apps, whether for accuracy, safety, bias, or other aspects where continuous observability is essential to ensure desired application behavior. There are solutions starting to emerge to aid with observability needs of Gen AI applications. The following table provides examples of some metrics that you should consider when designing an observability solution for Gen AI.
Performance |
Accuracy: The regularity which the model generates correct or desired outputs BLEU score: The quality of machine-translated text ROUGE score: The quality of automatically generated summaries Perplexity: The frequency that the language model predicts the next word in a sequence Latency: The time it takes for the model to generate an output Throughput: The number of outputs the model can generate per second |
Usage |
CPU utilization: The percentage of CPU used by the model Memory utilization: The percentage of memory used by the model GPU utilization: The percentage of GPU used by the model |
Model |
Feature drift: How much the distribution of feature values in production has changed from the distribution of feature values in the training data Model drift: How much the performance of the model has changed over time Explainability: How well the model can explain its outputs Fairness: How fair the model is to different groups of users |
Other |
Traffic: The number of requests received by the model Availability: The percentage of time that the model is available to serve requests Cost: The cost of running the model |
In addition, traditional operational best practices of MLOps such as IaC-based automations, redundancy, elasticity, testing, artifact, and source management should all extend to Gen AI applications to include LLM-related infrastructure, data, and security requirements.
Finally, you also especially need to extend change management to custom, fine-tuned LLM. This change management ensures tracking of changes across foundational and tuned LLM lifecycles.
While Gen AI offers significant potential, to make apps production ready, you have to prioritize security and compliance while ensuring the application follows responsible AI best practices. This includes familiarizing yourself with potential threats, and having a plan to mitigate them. Some of such mitigation techniques include:
In addition to these mitigation techniques, there are potential attack vectors to consider. Gen AI models can be vulnerable to adversarial attacks, so it’s important to implement security measures to protect your model from attack techniques such as prompt injections, model inversion, and backdoors.
Likewise, you should prepare a plan to implement responsible AI. Improperly trained AI models can cause direct or indirect harm to society by introducing bias, particularly against protected groups or certain sections of society. To mitigate this risk, it is important to protect against bias by leveraging diverse datasets, running adversarial training to identify and remove bias from the model, and implementing bias and safety filters to reduce the risk of the app generating harmful or offensive content. Use features like Grounding to establish lineage to facts.
In conclusion, taking a Gen AI application to production is a complex task, but it is essential for realizing the full potential of Gen AI. By considering the approaches discussed above across various phases you will be better equipped to develop cutting-edge Gen AI applications.
Want to learn more? Check out the full Gen AI Bootcamp workshop on-demand now!
Have questions? Please leave a comment below.
Authored by: