Evaluates model ability to build complete React applications with authentication, databases, and backend functionality from real-world user prompts. In addition to frontend agent behavior, the evaluation captures use of backend tools such as write_file, edit_file, batch_create_file, read_file, web_search, web_fetch, grep_search, deploy, bash, video_fetch, image_generation, image_fetch, propose_plan, supabase_create_table, supabase_insert, supabase_query, supabase_update, supabase_delete, supabase_auth_create_user, supabase_create_bucket, supabase_create_rls_policy
Loading...
Evaluates model ability across a composite of Website, UI Component, Game Development, Data Visualization, and 3D tasks, each produced as a single-file HTML output. Results aggregate real-world user preferences across these categories to provide an overall view of coding performance.
Loading...
Evaluates model ability to build multi-file games through an agentic coding workflow. Real-world users compare the final playable outputs, while the evaluation captures agent traces, tool calls, user re-prompts, failures, and retries.
Loading...
Evaluates model ability to build native Android applications in Kotlin from real-world user prompts. Applications are run in an Android emulator for faithful representation, and real-world users compare the resulting experiences while agent traces, tool calls, and user re-prompts are captured.
Loading...
Evaluates model ability to build functional cross-platform mobile applications using React Native from real-world user prompts. Real-world users compare the rendered applications, while the evaluation captures model code outputs, agent traces, tool calls, and user re-prompts.
Loading...
Loading...
Loading...
Loading recent tournaments...