Testing an automation system, a robot or a complex machine for reliability is a long, expensive, and challenging part of the product development lifecycle that can take months, or even years. And the later a flaw in the design is discovered, the harder, and more expensive, it will be to fix – as those automotive manufacturers forced to recall their vehicles, or those aircraft manufacturers forced to ground their planes, can attest.
Yet, there is a vast middle ground between the extremes of exhaustive testing during product development and the eschewing of innovation and building in of redundant systems we find in the automotive and aerospace sectors respectively. For the majority of commercial and domestic products a balance must be struck between very limited and exhaustive testing to obtain the right level of reliability. Design engineers have to decide how good is good-enough when it comes to testing designs, given the limited resources of both time and money available.
The challenge is particularly acute for disruptive industries that produce highly complex machines in relatively low volumes that cannot afford to build years-long testing into their development process, such as advanced scientific equipment, surgical robotics or additive manufacturing systems. Increasingly, product design for reliability in these industries will be informed by simulations and statistical approaches to testing.
In this blog:
- Using simulations to design and then test the operation of complex machines
- Exploiting statistics to test physical hardware
- Statistical methods for testing software
Using simulations to design and then test the operation of complex machines
With computing costs reducing by an order of magnitude every decade or so [1,2], it is easier than ever to justify the use of simulations to design products and then model performance characteristics. This will go beyond just modelling the basic operation of a system from its parts; it will enable the effects of variations in part tolerances to be analysed and the interaction between parts to be monitored in simulated conditions. Even extended life testing can be envisaged by running thousands of simulated models in parallel on powerful compute arrays.
The enabling factor for this grand vision will be a combination of more realistic physics-based simulations using libraries of well-defined parts with pre-determined reliabilities. As with all models, these will need supplementing with physical data, such as real-world values of wear and friction. Ultimately the simulations will be only as accurate as the assumptions and simplifications made, and how well the real world can be represented.
One methodology that will be used more frequently in this world of virtual testing will be statistical simulations. For example, each model can be built slightly differently by simulating part tolerance and, importantly, the stack-up of tolerances tested using a Monte Carlo approach. The results of these will, within a day or two, and often generating thousands of data points, give confidence that a design will work flawlessly, or fail with a certain frequency, simulating the equivalent of hundreds of thousands of hours of operation.
By building up distributions of simulated tolerances we can tease apart the stack-up effects of mechanical tolerances. This is highly useful as in robotic systems, and many other mechanical systems, the stack-up of mechanical tolerances can exceed the maximum operational window and become a major source of failures. A technique that tells us this will happen before we commit to a design, would be highly useful. On the other hand, the technique can also reveal which parts of a system will be highly accurate, and repeatedly so. A future of statistical simulations is one that enables confidence in a design before any hardware prototype has been committed to production.
Exploiting statistics to test physical hardware
Statistical testing is useful in real-life testing of physical hardware too. One example of a real-life implementation of a statistical approach is deliberately forcing a physical system to operate at the margins of expected operating conditions, where failure modes occur sporadically. A suitably designed set-up can identify failure modes in a week that will only be seen infrequently or never during a normal test run; these one in 100,000 failure events might only appear during multi-leg testing of tens of physical systems working in parallel full-time over many months.
This is not the same as extended life testing, but it allows the boundaries of operation to be explored and mapped in more detail. Failures observed may have the character of a cliff-edge in mechanical error, where jams occur, others may be more gradual, resulting in more unexpected failure modes. Generally, failures observed during statistical testing of hardware are totally unexpected by the original design team. The result of finding where, and how, failures occur is that the degree of operational margin can be quantified; and fixes quickly identified, where necessary.
Statistical methods for testing software
Testing of a control system’s behaviour, one containing software, is much harder due to the large number of parameter variables involved. It is fundamentally a combinatorial problem. Indeed, perhaps the largest challenge in reliability testing generally stem from the software in a system, rather than the well-understood hardware platform. Innovations such as the unit test and software harness are used to test what is effectively a black box against requirements, but even these can fail if the unexpected operational corner cases, or component interactions with each other and with the real world, are not captured. It is these corner cases where a system then enters an undefined area of operation that leads to failure.
But the tyranny of the combinatorial problem has a solution and the future has to include it; or we face an increasingly frustrating future of bug-filled software. The challenge of writing software is that as the number of variables to test increases linearly, the number of combinations rises disproportionally; as variables are not independent but can and will interact with each other. This dysfunctional relationship is the primary reason why the imperfectly tested software we use every day is, largely, littered with bugs. The solution to this testing challenge is based on the mathematics of orthogonal arrays, originally championed by Genichi Taguchi for quality processing and manufacturing and the foundation of the larger, more mathematically sound, area of Combinatorial Testing [3].
Orthogonal arrays make it possible to radically reduce the number of design experiments in such situations. Helpfully, organisations such as the National Institute of Standards and Technology have precomputed the small number of experiments required based on the number of different variables. They describe an example of 6-way testing of a mobile phone application that would conventionally lead to an exhaustive set of 172,800 configurations. With combinatorial testing only a sub-set of 9,168 experiments is required for 6-way coverage; a reduction of 95% [4]. For a smaller number of variables, the reduction in testing time and cost is greater. Future software will be more reliable once the idea of combinatory testing becomes more established as a testing methodology.
Conclusion
Testing has always been a long and tedious part of development but necessary to ensure reliable products go into production. Whatever the industrial sector, there are powerful methods to improve the reliability of all classes of automation, embracing both the hardware and software components of the design. The future should see some of these tools being used to expedite the development path and short-cut both the costs and eliminate much of the frustration.
References
02. AWS c4.8xlarge