This article was fact checked & last verified by Daniel Fazekas in:
Blog
Using AI (Computer User Agent) models in automated testing for easier and quicker tests
Using AI (Computer User Agent) models in automated testing for easier and quicker tests
One of Scriptide's staff software engineers, Botond Kovács, made a proof of concept (POC) TypeScript framework that uses OpenAI's Computer User Agent combined with traditional end-to-end testing frameworks like Playwright/Puppeteer. This POC allows for writing automated software tests more easily and quickly than before, treating the testable application as a black box instead of having to rely on test IDs, eliminating the need for code changes in the original application.

Written by
Botond Kovács
Last updated
MAR 02, 2026
Topics
#dev
Length
6 min read

AI-based End-To-End Test Automation
Quick Demonstration
Meet The Model
computer-use-preview model, which also powers OpenAI's product called Operator and ChatGPT's new Agent Mode. This model is designed to interact with graphical user interfaces.Writing Test Cases
clickOn(description): instructs the tester to click on something on the screen, identified by the given description.fillIn(fields): instructs the tester to fill in the given fields, for example on a form.expectOnScreen(description): instructs the tester to verify that something is true about what is currently visible on the screen.
clickOn works under the hood.Making The AI Click
computer-use-preview model, by implementing a so-called Computer User Agent Loop.- Take a screenshot of the current state of the application
- Send the screenshot to the model, along with any other messages
- The model decides to either ...
- Use the computer to perform some input action (e.g. clicking on coordinates, typing text, etc.)
- Respond with a message, indicating that it has completed the task, or cannot continue on its own
- We perform the input actions of the model, and wait a little so all inputs are processed
- We repeat the loop until the model responds with a message
clickOn action:{{targetDescription}} is replaced with the actual description of the element to click on, for example "The sign up button".How It Works

Making It Useful
mocha as the test runner, and puppeteer as the browser automation library. We built a small framework around these tools, which provides the high-level API we wanted..test.js files.Translating To Other Languages
- We want to write tests once, but have them work on any language understood by the AI.
- We want to write tests in either our language or the language of the customer to communicate requirements of the system behavior precisely.
generateDataField function. This function:- Checks the language of the page, using the
langattribute of the HTML document - Instructs the AI to generate data for a field, provided a description, but regardless of the language of the description, it must generate it in the language of the page.
aliases.ja.ts and aliases.hu.ts files, which basically do this:Limitations
- The computer-use-preview model sometimes struggles to understand non-English interfaces. We tested it extensively with Japanese interfaces, and while it often works, its unreliability creates false positives and false negatives that must be manually reviewed.
- The model does not support structured outputs yet, and sometimes fails to follow response format instructions. Deciding whether a step has passed or failed can be difficult. We mitigated this by using a second model to review CUA outputs, adding cost and complexity.
- The model works best for larger interface elements, but struggles with actions requiring pixel-perfect precision. It fails notably with small icon-only buttons in video games.
- Due to latency, certain real-world behavior cannot be reliably tested. For example, auto-disappearing toast messages are sometimes missed, as they vanish before the model can react.
Future Work
- One technique we are currently experimenting with, is to use a non-CUA model with understanding of the DOM-under-test before invoking the CUA. If the model cannot perform the step (for example because test IDs are missing), we can invoke the CUA as a fallback. This reduces the number of CUA invocations, and thus the cost and time of running the tests.
- We are also experimenting with multiple techniques to more reliably detect false-positives and false-negatives. Other AI models can be used to review the evidence collected during the CUA loop (screenshots, browser logs, actions, etc.), and decide whether the original conclusion of the CUA was correct or not.
- Another way of reducing the cost of the CUA loop is to implement a better conversation state management. Realistically, we only need to keep the last few rounds, and discard earlier messages. This would limit the context size required for any step, and thus reduce the token count.
Interested In This POC?
Scriptide is a strategic technology partner specializing in the development of custom, complex B2B software solutions. We provide a comprehensive suite of services, including digital transformation, web and mobile development, and the integration of AI and blockchain technologies.
Get a free IT consultation. We are excited to hear from you.
Liked this article? Subscribe for more.
We handle your data with maximum discretion. By clicking 'Keep me posted' you consent to processing your data by Scriptide Ltd. for marketing purposes, including sending emails. For details see our Privacy Policy.
You might also like these articles!
Click for details
Improve Lighthouse Performance: How To Maximize a Website's SEO Scores
Google Lighthouse and Google PageSpeed Insights have become industry standard tools for measuring the overall performance of web applications. But what purpose do these tools serve, and why should we pay attention to our score? Can we improve our audit results, or is it set in stone? In the next article, we will explore how we managed to improve the score of Scriptide's website and why we chose to take this step.
#dev
•
JUL 01, 2025
•
4 min read
Click for details
Benefits of Full-Stack Development
Full-stack development is becoming increasingly popular among tech companies—but why? How is it better than the traditional, well-established backend-frontend separation? In this article, we’ll explore the benefits it offers compared to a divided backend/frontend model. This is the first part of a two-piece series. In the next article, we’ll explore how type safety and code sharing between the backend and frontend can improve code quality, enhance developer experience, and accelerate development and delivery.
#dev
•
JUN 04, 2025
•
3 min read