Medical AI benchmark toolkit

From clinical question to validated benchmark

ARISEKit is a toolkit that helps clinicians, researchers, and hospital teams design, plan, and build rigorous medical AI tools — without starting from a blank file. A clinician with an idea (“can ChatGPT tell me which of my patients need ACL repair?”) and a de-identified dataset should be able to finish with a runnable, MAST-compatible benchmark, plus the human-rater plan needed to validate it.

Start the workshop About ARISE research

What ARISE Kit covers

Three pieces of the stack, one workflow

Three pieces fit together. The model provides reasoning, the FHIR MCP layer gives the agent eyes and hands on patient data, and ARISEKit evaluates the output.

01.Design

Author the agent

Start from a clinical or admin archetype and write the agent Skill that defines how your agent behaves, all without leaving the browser.

02.Tools

Connect to FHIR data

Wire MCP-style tools so your agent can read and act on patient and operational data.The catalog is pre-built for workshop speed; the patterns transfer to production.

03.Evaluation

Score before you ship

Turn rubrics, test cases, and a judge model into a quantitative eval. Know whether your agent is actually good, not just impressive in a demo.

Hands-on workshop

Ready to build something?

The ARISE Kit workshop walks you through the full loop in 90–120 minutes: design, tooling, eval, and take-home artifacts. No local setup required.

Enter the workshop