Date

April 16, 2024

RAGAS for dummies

Paper: https://arxiv.org/pdf/2309.15217.pdf

Acronym: Retrieval Augmented Generation Assessment

Definition: RAGAS is a framework for evaluating RAG system

If you’re not familiar with what a RAG system is, first read this post.

There are many reasons to evaluate:

Accuracy and Performance Measurement
Model Improvement
Validation and Verification
Comparison with Other Models
Ensuring Reliability and Robustness
Avoiding Overfitting
Compliance with Standards
Building Trust
Resource Allocation
Ethical Considerations

The best way I can sum it up is using Andrew Ng. old but gold phrase “Garbage in Garbage out”.

Put into context, you won’t be able to improve unless you’re able to measure.

“Thanks you, why can’t I simply use accuracy?”

Accuracy, precision, recall, F1 may be enough for classical ML models but now with the paradigm shift brought on by LLMs of immaculate fluency and reasoning capabilities they are no longer adequate.

RAGAS uses the following information

Query
Qurey’s additional Context
Generated answer
Ground truth (labeled dataset with actual validated questions and answers)

RAGAS uses the above four inputs to give us actionable metrics.

For now we will focus only on the ones which do not require ground truth.

Metrics

Context Relevancy

How:

Extracts sentences from context that are required to answer the question.
Computes the score = # of extracted relevant sentences / # of sentences in question.

Answer relevancy

Is the information in the answer redundant.

Example:

Question: what time is it?

Context: …

Answer: the time is 15:00 and the sky is blue.

“The sky is blue” is redundant information.

How:

Generates possible questions for a given answer.
Compute similarity between generated questions and the original question.

Answer faithfulness

How factually accurate the answer is. In other words, did the model hallucinate.

How:

Extract statements from answer.
Verify statement against the context.
Compute score = verified / extracted.

All of the above metrics run on a scale (0,1) where higher is better.

Aspect critique

Evaluated based on the answer. Returns a binary output - 0 or 1 depending on misalignment. Is not similar to the about to compute a general RAGAS score but none the less very interesting.

Premade

SUPPORTED_ASPECTS = [ harmfulness, maliciousness, coherence, correctness, conciseness, ]

Custom

Define your own aspect such as:

Language Appropriateness
Educational Value
Age-Appropriate Content
Positive Messaging
Cultural Sensitivity

Conclusion

Evaluation is normally the first step we take with customers after some very basic prompt engineering. Let’s assume you belong to 90% of the market and aren’t fine tuning or god forbid training your own fundamental model (by all means scrooge mcduck away).

Check out RAGAS documentation.

APPENDIX

Context Recall

This part uses the ground truth - an annotated dataset with questions and answers.

How:

Every sentence from the ground truth answer is analyzed of whether it can be attributed to a sentence in the context.
Compute ratio of attributed sentences

Metrics

Context Relevancy

Answer relevancy

Answer faithfulness

Aspect critique

Premade

Custom

Conclusion

Context Recall

Company

Explore

Socials