RAGAS for dummies
Paper: https://arxiv.org/pdf/2309.15217.pdf
Acronym: Retrieval Augmented Generation Assessment
Definition: RAGAS is a framework for evaluating RAG system
If you’re not familiar with what a RAG system is, first read this post.
There are many reasons to evaluate:
- Accuracy and Performance Measurement
- Model Improvement
- Validation and Verification
- Comparison with Other Models
- Ensuring Reliability and Robustness
- Avoiding Overfitting
- Compliance with Standards
- Building Trust
- Resource Allocation
- Ethical Considerations
The best way I can sum it up is using Andrew Ng. old but gold phrase “Garbage in Garbage out”.
Put into context, you won’t be able to improve unless you’re able to measure.
“Thanks you, why can’t I simply use accuracy?”
Accuracy, precision, recall, F1 may be enough for classical ML models but now with the paradigm shift brought on by LLMs of immaculate fluency and reasoning capabilities they are no longer adequate.
RAGAS uses the following information
- Query
- Qurey’s additional Context
- Generated answer
- Ground truth (labeled dataset with actual validated questions and answers)
RAGAS uses the above four inputs to give us actionable metrics.
For now we will focus only on the ones which do not require ground truth.
Metrics
Context Relevancy
How:
- Extracts sentences from context that are required to answer the question.
- Computes the score = # of extracted relevant sentences / # of sentences in question.
Answer relevancy
Is the information in the answer redundant.
Example:
Question: what time is it?
Context: …
Answer: the time is 15:00 and the sky is blue.
“The sky is blue” is redundant information.
How:
- Generates possible questions for a given answer.
- Compute similarity between generated questions and the original question.
Answer faithfulness
How factually accurate the answer is. In other words, did the model hallucinate.
How:
- Extract statements from answer.
- Verify statement against the context.
- Compute score = verified / extracted.
All of the above metrics run on a scale (0,1) where higher is better.
Aspect critique
Evaluated based on the answer. Returns a binary output - 0 or 1 depending on misalignment. Is not similar to the about to compute a general RAGAS score but none the less very interesting.
Premade
SUPPORTED_ASPECTS = [ harmfulness, maliciousness, coherence, correctness, conciseness, ]
Custom
Define your own aspect such as:
- Language Appropriateness
- Educational Value
- Age-Appropriate Content
- Positive Messaging
- Cultural Sensitivity
Conclusion
Evaluation is normally the first step we take with customers after some very basic prompt engineering. Let’s assume you belong to 90% of the market and aren’t fine tuning or god forbid training your own fundamental model (by all means scrooge mcduck away).
Check out RAGAS documentation.
APPENDIX
Context Recall
This part uses the ground truth - an annotated dataset with questions and answers.
How:
- Every sentence from the ground truth answer is analyzed of whether it can be attributed to a sentence in the context.
- Compute ratio of attributed sentences