To what extent will AI fundamentally change the way we work? What about the change management process? Will AI augment or replace testers? How do we introduce an AI initiative and ensure we are taking the testing team along with us on the journey?

Although biology often inspires human innovation, it hardly leads to a direct implementation. Birds taught humans that flying is possible and inspired human creativity for centuries. But the design of today’s planes and helicopters does not have much in common with their biological role models.

As humans learn and apply principles, we adapt them to our needs. Instead of creating mechanical legs for our vehicles, that can climb over obstacles, we removed the obstacles and paved the way for our wheeled transportation­­—which happens to be both faster and more efficient.

The same will be true for our AI efforts in testing: hardly will they be a faithful recreation of human testing efforts. To better understand where AI could be applied in the overall testing process, we need to break down the individual tasks and challenges of a tester.

Like a motor is no direct replacement for a muscle, we need to understand the underlying motivation for each task, and how it interplays with the overall testing goals, so that we can envision how the process could be improved and altered while the goals are still being served. So, in the following, we are talking about goals, not actual tasks of human testers.

On a very coarse level, testing can be divided into two situations, where it is applied:

  • Testing new software and functionality
  • Testing existing software and functionality

Testing new functionality

New functionality requires thoughtful testing. We must make sure the new functionality makes sense, adheres to UX design principles, is safe and secure, is performant and just generally works as intended. More formally, the ISO 25010 standard consists of 8 main characteristics for product quality, which we will address individually:

  • Functionality (Completeness, Correctness, Appropriateness)
  • Performance (Time behavior, Resource utilization, Capacity)
  • Compatibility (Co-existence, Interoperability)
  • Usability (Operability, Learnability, User error protection, User interface aesthetics, Accessibility, Recognizability)
  • Reliability (Maturity, Availability, Fault tolerance, Recoverability)
  • Security (Confidentiality, Integrity, Non-repudiation, Accountability, Authenticity)
  • Maintainability (Modularity, Reusability, Analysability, Modifiability, Testability)
  • Portability (Adaptability, Installability, Replaceability)

Asserting correct and complete functionality is basically an AI-complete problem, meaning that the AI needs to be at least as intelligent as a human to be able to do that. For example, when searching for first and last name in a social network site like Facebook, it should return all people with the specified names. When doing the same on a privacy sensitive site like Ashley Madison, this would be a severe problem.

Whether any give functionality is correct or faulty generally lies in the eye of the beholder. This problem is called the oracle problem, because we would need a Delphian Oracle to tell whether a certain displayed functionality is correct. That means, that in the foreseeable future, we cannot use AI to test for the correctness of software functionality.

Performance criteria on the other hand usually can be specified in an simple and very general manner—e.g. a site should not load longer than 2 seconds and after pressing a button, feedback should not be later than 500 milliseconds. So AI could test for performance, and indeed, products doing that are already available.

Compatibility can have many different meanings. Some widespread instances of compatibility testing, like cross-browser testing, cross-device testing or cross-os testing, which focus mainly on design and functionality can easily be automated. And again, we are already seeing products for that. Other compatibility issues are much more subtle, technically oriented or specific. Developing specialized AI in those cases is often prohibitive for economic reasons.

Usability is currently yet hard for current AI-systems to analyze, although this may be a promising venue in the future. Interesting enough, improving the usability of a software may also improve the ability of AI-systems to understand and test the software, thus leading to further incentives to do so.

Even without AI, there already exists software that analyzes some aspects of the reliability of a software system, such as fault tolerance and recoverability. AI will only improve such analysis and yield better results. Other aspects like maturity and availability are more connected to the long-term usage and operation of such systems and are generally hard to test for—even for humans.

Also for security, there already exists software that tests for some aspects, using existing and well-known attack scenarios. Apart for such standard-attacks, security in general is very hard to test for. Security analysts are usually high-paid professionals, that are very well-versed in their field and ingeniously combine various specific aspects of the system to find new weaknesses and loopholes. If business functionality is hard to test with AI, security (apart for known attacks) is the royal discipline, that will be tackled last.

Maintainability and portability are usually more internal aspects of the software system, very relevant to the development and operation of the system, but hardly tested for.

The ISO 25010 standard also defines 5 characteristics for quality in use:

  • Efficiency
  • Satisfaction (Usefulness, Trust, Pleasure, Comfort)
  • Freedom from risk (Economic, health and safety and environmental risk mitigation)
  • Context coverage (Context completeness and Flexibility)

As can is obvious, these characteristics all relate to the outcome of human interaction with the software. As such, they are highly personal and can hardly be qualified and tested for in a systematic manner.

It is also clear that, although the aforementioned characteristics are all important for a software product, they hardly account for the same amount of testing effort in the field. Numbers are hard to come by with, but it seems clear that testing for correct and complete functionality is the lion’s share of effort. Unfortunately, this is also the aspect where, following the Oracle problem, we said that we couldn’t employ AI to help us.

But not so fast: A huge part of testing for correct functionality of software is not done on new software, but on existing software. Maybe this could somehow remedy the problem?

Testing existing functionality

Software is very unlike many things we encounter in the non-digital world. If, e.g. we repair the front light of a car, we do not need to test the horn. But because software has so many invisible and unknown inter-dependencies, making a change to one part of a software-system could have unforeseen and unintended side-effects to basically any other part of the system.

Therefore, it is necessary to retest already tested and approved functionality, even if it wasn’t changed, just to make sure that it did not change indeed. This form of testing is called regression testing, and it makes up a significant amount of the overall testing effort.

Now the very interesting aspect of regression testing is, that it is about already tested and approved functionality. Which means that instead of testing for correctness, we can focus on testing for changes. Following this train of thought, regression testing is not so much a form of testing, but a specific form of change control. Developers already routinely use change control in the form of version control systems. The problem is, that these systems only govern static artifacts, like source code and configuration files.

Software as it is encountered by users and subject to testing, however, is a dynamic thing, living in the memory of the computer. The program code and configuration is the starting point for creating this dynamic state of the software. But many more ingredients, such as the specifics of the underlying hardware and operating system, the input data und the user interaction, form that dynamic software state.

While the source code and configuration is analogous to the original blueprint of a building, the dynamic state is comparable to the actual building. The concrete characteristics of that building depend on many more aspects, like building materials, painting, furniture, decoration and houseplants, all of which are not part of the blueprint, yet all are completely relevant to the user experience of the building. The same is true for the dynamic state of the software.

To remedy the fact that the encountered essence of the software, the dynamic state, is not governed by the version control system, the current state of affairs is to create and maintain automated regression tests. These tests then codify the dynamic state of the software, and as such, turn it into static artefacts­—which are governable by existing version control systems. The problem, however, is that most existing regression test systems are modeled after the very successful JUnit.

Part of this heir includes the checking mechanism. This checking mechanism consists of individual checks (called asserts), which check one single fact at a time. These facts are considered to be hard (and unchanging) truths. As such, these tests are currently created and maintained manually, bearing a lot of effort, and are not well geared towards detecting and allowing changes.

However, there exist alternatives to this approach. These systems go by names like Golden Master testing, Characterization testing or Snapshot -based testing and are just now coming to fashion. Not only are these tests much easier to create, they are also easier to maintain, as detected changes can simply be applied to the underlying test if intended. Additionally, it showed that these tests remedy some of the other long-standing issues of regression tests.

Using this testing paradigm, an AI could thus create such Golden Master tests for an existing (and approved) version of the software. After changing the software, these tests would then show to the human tester changes of the functionality (or the absence thereof). A human tester would then only need to review new functionality or detected changes to existing functionality.

In many cases, this already bears huge savings in effort and tremendous decrease in risk. The reason why this works for AI is simply that it circumvents the oracle problem. The AI now does not need to decide whether a specific functionality is correct—it merely needs to execute the software and record its behavior.

Having solved the main challenge, that today keeps AI from testing software, we are now able to turn to some remaining challenges. These are additional challenges, that we would face even if we could somehow magically solve the oracle problem. One is, that the AI needs to understand how to execute the software. That is, given a (possibly empty) track of previous actions and a current state of the software, the AI needs to decide what user action to perform next.

Formulated like that, the problem is very comparable to that of playing a game like Chess or Go. Actually, we already have AIs that play computer games, having to solve the exact same problem. So, we have a clear path for how to accomplish the task. The only difference is how to formulate a suitable reward function.

For computer games, such a reward function is rather easy: “increase the number of points”. For executing different use cases of business software, this would maybe be something like “increase code coverage” or some similar metric. Supplying recordings of typical usage scenarios for the AI to learn from would overcome initial challenges like guessing a correct username / password combination or find valid values for date, email or other more obscure input data (think SAP transaction code).

In the process of generating such recordings, the AI could already test for performance and some aspects of reliability and security, as mentioned above. Any technical errors it would encounter (where the oracle is the fact that such errors should simply never occur), it could report, making separate smoke testing obsolete as well. Note that, as mentioned above, improvements to the usability of the software will probably boost the performance of AI in testing as well.

It is noteworthy, that we already and for a long time have an automated testing approach that is, in principle, capable to achieve the same results. It is called monkey testing. This approach is named after the monkey theorem, which states that a monkey on a typewriter, hitting random keys for eternity, will eventually write all works of Shakespeare. The reasoning is simple: in eternity, it will produce all possible combinations of characters.

One such (long) combination will be the works of Shakespeare, together with any possible variation thereof. Monkey testing simply applies this theorem to testing, generating random inputs on the GUI. There already exist systems for that. Using AI, we simply increase the efficiency and get some valuable results in reasonable time, rather than in eternity.

A new testing process

Given the insights from the previous sections, a new testing process could be envisioned, that looks like the following: A new software is created. Human testers make sure that this software is correct and complete and is usable and secure. Note that the first two tasks could as well be assigned to the role of business analysts.

The software is then given to an AI, which is trained by recordings of typical usage scenarios and thus knows how to execute the software. The AI executes the software and records sufficiently many different input/output scenarios as Golden Master that allows to detect changes in the next version of the software. Other quality aspects are take care of by the AI as well, e.g. it tests for performance, known security attacks and for fault tolerance.

Using the feedback from the AI, from testers/business analysts and from actual users, the developers improve the software in the next sprint. A subset of the Golden Master tests could be executed on a nightly basis or after every commit, providing early feedback for developers. After the next version is created, the full set of Golden Master tests is executed, showing every change in behavior and allowing for both easy approval of those changes and stable GUI tests.

This will also increase the test coverage and dramatically reduce the risk of undetected changes. Testers are then free to focus on new functionality and changes to the behavior of the software. Note that this also allows for much better tracking (who approved which change?) and easier certification of the software.

This process will free up testers of repetitive and mundane tasks, such as manual regression testing. It will thus augment testers and not replace them. What we are talking here about is, essentially, codeless and autonomous automation—two buzzwords that have been haunting the realm of test automation tools for years but turned out to be promises which tool vendors have failed to deliver upon. This means that we are freeing many testers from a career choice they don’t want to make—veering into test automation. Applying AI to testing like this, testers have much to gain, but practically nothing to lose.

Long-term perspective

The proposed process changes are such that they can be achieved with AIs current capabilities. Researchers expect that these capabilities will only improve and broaden over time. Once AI has gained human or super-human capabilities, there are practically no tasks that AI cannot perform, from tester to developer to manager. But it is yet unclear when that mark will be reached. And on the path to reaching these capabilities, there are many more interesting milestones.

One ongoing discussion is whether AI threatens the jobs of testers. Following the above train of thought will yield near-complete automated tests, together with the capabilities to generate more such tests on demand. That basically shatters the problem of impact analysis—finding out which parts of the software any given change affects.

Solving this problem allows to apply AI to the adaption and generation of source code. Think of automatically generating patches for bugs, automatically dissolving performance bottlenecks or automatically improving the quality of source code by restructuring it, e.g. into shorter methods and classes.

No major capability comes with a big bang. We had driver assistances that helped us stay in lane, adapt headlights or keep distance long before we had full-fledged autonomous driving. The same will be true for the development and testing of software.

Having AI generate or improve little parts of the code will be the first steps towards generating simple methods, modules or eventually whole systems. And when that happens, the oracle problem will still be unsolved. So even with these approaches, someone still needs to make sure that the generated functionality is correct and complete.

Whether this role is then called developer, business analyst or tester, is beyond my guess. But in my view, those who currently call themselves developers should probably be more worried about the long-term prospect of their jobs than those who call themselves testers.