evaluation

Test your prompts, agents, and RAGs. Use LLM evals to improve your app's quality and catch problems. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd cicd prompts evaluation-framework rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Jun 12, 2024
TypeScript

cdaringe / programming-language-selector

Star

Programming Language Selector based on language metadata and user-specified values.

decision-making evaluation languages

Updated Jun 12, 2024
TypeScript

Psycoy / MixEval

Star

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Jun 12, 2024
Python

modelscope / eval-scope

Star

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking

performance evaluation llm

Updated Jun 12, 2024
Python

onejune2018 / Awesome-LLM-Eval

Star

Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表，主要面向基础大模型评测，旨在探求生成式AI的技术边界.

nlp benchmark machine-learning leaderboard evaluation dataset openai llama bert rag awsome-list gpt3 llm awsome-lists chatgpt large-language-model chatglm qwen llm-evaluation

Updated Jun 12, 2024

Cloud-CV / EvalAI

Star

☁️ 🚀 📊 📈 Evaluating state of the art in AI

python angularjs docker challenge machine-learning django ai reproducible-research leaderboard evaluation artificial-intelligence ai-challenges reproducibility evalai angular7

Updated Jun 12, 2024
Python

langfuse / langfuse

Star

🪢 Open source LLM engineering platform: Observability, metrics, evals, prompt management, playground, datasets. Integrates with LlamaIndex, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai gpt observability large-language-models llm prompt-engineering langchain llmops llama-index prompt-management evals llm-evaluation

Updated Jun 12, 2024
TypeScript

langchain-ai / langsmith-sdk

Star

LangSmith Client SDK Implementations

evaluation language-model observability

Updated Jun 12, 2024
Python

tatsu-lab / alpaca_eval

Star

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

nlp deep-learning leaderboard evaluation instruction-following foundation-models large-language-models rlhf

Updated Jun 12, 2024
Jupyter Notebook

langchain-ai / langsmith-docs

Star

Documentation for langsmith

testing documentation evaluation tracing langchain langsmith

Updated Jun 12, 2024
MDX

NTDLS / CMathParser

Star

A fairly robust mathematics parsing engine for C++ projects.

library parsing math evaluation mathematics showcase expression-parser

Updated Jun 11, 2024
C++

ncalc / ncalc

Star

Mathematical Expressions Evaluator for .NET

parser csharp math runtime async dotnet evaluation antlr antlr4 expressions ncalc

Updated Jun 11, 2024
C#

JieyuZ2 / TaskMeAnything

Star

A task generation and model evaluation system.

benchmark evaluation foundation-models

Updated Jun 11, 2024
Python

kolenaIO / kolena

Star

Python client for Kolena's machine learning testing platform

testing machine-learning evaluation evaluation-metrics evaluation-framework mlops evaluate-models llmops

Updated Jun 12, 2024
Python

microsoft / rag-experiment-accelerator

Star

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.

experiment information-retrieval azure evaluation indexing openai sparse vectors chunking acs embedding dense rag llm genai

Updated Jun 11, 2024
Python

VectorInstitute / cyclops-workshop

Star

CyclOps for clinical ML evaluation & monitoring workshop

monitoring evaluation

Updated Jun 11, 2024
Jupyter Notebook

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

Here are 1,116 public repositories matching this topic...

ziqihuangg / Awesome-Evaluation-of-Visual-Generation

symflower / eval-dev-quality

Xnhyacinth / Awesome-LLM-Long-Context-Modeling

athina-ai / athina-evals

promptfoo / promptfoo

cdaringe / programming-language-selector

Psycoy / MixEval

modelscope / eval-scope

onejune2018 / Awesome-LLM-Eval

Cloud-CV / EvalAI

langfuse / langfuse

langchain-ai / langsmith-sdk

tatsu-lab / alpaca_eval

langchain-ai / langsmith-docs

NTDLS / CMathParser

ncalc / ncalc

JieyuZ2 / TaskMeAnything

kolenaIO / kolena

microsoft / rag-experiment-accelerator

VectorInstitute / cyclops-workshop

Improve this page

Add this topic to your repo