Software Testing Best Practices
Ram Chillarege
Center for Software Engineering, IBM Research, Circa 1997
Download the Original Report in PDF
Abstract:
This report lists 28 best practices that contribute to improved software testing. They are not necessarily related to software test tools. Some may have associated tools but they are fundamentally practice. The collections represent practices that several experienced software organizations have gained from and recognize as key.
Introduction
Every time we conclude a study or task force on the subject of software
development
process I have found one recommendation that comes out loud and clear.
"We need to adopt the best practices in the industry." While it appears
as an obvious conclusion, the most glaring lack of it's presence
continues to astound the study team. So clear is its presence that it
distinguishes the winners from the also-ran like no other factor. The
search for best practices is constant. Some are known and well
recognized, others debated, and several hidden. Sometimes a practices
that is obvious to the observer may be transparent to the practitioner
who chants "that's just the way we do things." At other times what's
known in one community is never heard of in another. The list in this
article is focused on Software Testing. While every attempt is made to
focus it to testing, we know, that testing does not stand alone. It is
intimately dependent on the development activity and therefore draws
heavily on the development practices. But finally, testing is a separate
process activity -- the final arbiter of validity before the user
assesses its merit.
The collection of practices have come from many sources -- at this point indelibly blended with its long history. Some of them were identified merely through a recognition of what is in the literature; others through focus groups where practitioners identified what they valued. The list has been sifted and shared with increasing number of practitioners to gain their insight. And finally they were culled down to a reasonable number. A long list is hard to conceptualize, less translate to implementation. To be actionable, we need to think in terms of steps -- a few at a time, and avenues to tailor the choice to our own independent needs. I like to think of them as Basic, Foundational, and Incremental.
The Basics are exactly that. They are the training wheels you need to get started and when you take them off, it is evident that you know how to ride. But remember, that you take them off does not mean you forget how to ride. This is an important difference which all too often is forgotten in software. "Yeah, we used to write functional specification but we don't do that anymore" means you forget to ride, not that you didn't need to do that step anymore. The Basic practices have been around for a long time. Their value contribution is widely recognized and documented in our software engineering literature. Their applicability is broad, regardless of product or process.
The Foundationalpractices are the rock in the soil that protects your efforts against harshness of nature, be it a redesign of your architecture or enhancements to sustain unforeseen growth. They need to be put down thoughtfully and will make the difference in the long haul, whether you build a ranch or a skyscraper. Their value add is significant and established by a few leaders in the industry. Unlike the Basics, they are probably not as well known and therefore need implementation help. While there may be no textbooks on them yet, there is plenty of documentation to dig up.
The Incremental practices provide specific advantages in special
conditions. While they may not provide broad gains across the board of
testing, they are more specialized. These are the right angle drills --
when you need it, there's nothing else that can get between narrow studs
and drill a hole perfectly square. At the same time, if there was just
one drill you were going to buy, it may not be your first choice. Not
all practices are widely known or greatly documented. But they
all possess the strength that are powerful when judiciously applied.
The next sections describe each of the practices and are grouped under Basics, Foundational, and Incremental.
The Basic Practices
- Functional Specifications
- Reviews and Inspection
- Formal entry and exit criteria
- Functional test - variations
- Multi-platform testing
- Internal Betas
- Automated test execution
- 'Nightly' Builds
1. Functional Specifications
Functional specifications are a key part of many development processes
and came into vogue with the development of the waterfall process. While
it is a development process aspect, it is critically necessary for
software functional test. A functional specification often describes the
external view of an object or a procedure indicating the options by
which a service could be invoked. The testers use this to write down
test cases from a black box testing perspective. The advantage of having
a functional specification is that the test generation
activity could happen in parallel with the development of the code. This
is ideal from several dimensions. Firstly, it gains parallelism in
execution, removing a serious serialization bottleneck in the
development process. By the time the software code is ready, the test
cases are also ready to be run against the code. Secondly, it forces a
degree of clarity from the perspective of a designer and an architect,
so essential for the overall efficiencies of development. Thirdly, the
functional specifications become documentation that can be shared with
the customers to gain an additional perspective on what is being
developed.
2. Reviews and Inspection
Software inspection, which was invented by Mike Fagan in the mid 70’s at IBM, has grown to be recognized as one of the most efficient methods of debugging code. Today, 20 years later, there are several books written on software inspection, tools have been made available, and consulting organizations teach the practice of software inspection. It is argued that software inspection can easily provide a ten times gain in the process of debugging software. Not much needs to be said about this, since it is a fairly well-known and understood practice.
3. Formal Entry and Exit Criteria
The notion of a formal entry and exit criteria goes back to the
evolution of the waterfall development processes and a model called
ETVX, again an IBM invention. The idea is that every process step, be it
inspection, functional test, or software design, has a precise entry and
precise exit criteria. These are defined by the development process and
are watched by management to gate the movement from one stage to
another. It is
arguable as to how precise any one of the criteria can be, and with the
decrease of emphasis development, process entry and exit criteria went
out of currency. However, this practice allows much more careful
management of the software development process.
4. Formal Entry and Exit Criteria
The notion of a formal entry and exit criteria goes back to the
evolution of the
waterfall development processes and a model called ETVX, again an IBM
invention. The
idea is that every process step, be it inspection, functional test, or
software design, has a precise entry and precise exit criteria. These
are defined by the development process and are watched by management to
gate the movement from one stage to another. It is
arguable as to how precise any one of the criteria can be, and with the
decrease of
emphasis development, process entry and exit criteria went out of
currency. However,
this practice allows much more careful management of the software
development process.
5. Functional Test - Variations
Most functional tests are written as black box tests working off a
functional specification. The number of test cases that are generated
usually are variations on the input space coupled with visiting the
output conditions.
A variation refers to a specific combination of input conditions to
yield a specific output condition. Writing down
functional tests involves writing different variations to cover as much
of the state space as one deems necessary for a program. The best
practice involves understanding how to write variations and gain
coverage which is adequate enough to thoroughly test the function. Given
that there is no measure of coverage for functional tests, the practice
of writing variations does involve an element of art. The practice has
been in use in many locations within IBM and we need to consolidate our
knowledge to teach new function testers the art and practice.
6. Multi-platform Testing
Many products today are designed to run on different platforms which
creates the additional burden to both design and test the product. When
code is ported from one platform to another, modifications are sometimes
done for performance purposes. The net result is that testing on
multiple platforms has become a necessity for most products. Therefore
techniques to do this better, both in development and testing, are
essential. This
best practice should address all aspects of multi-platform development
and testing.
7. Internal Betas
The idea of a Beta is to release a product to a limited number of
customers and get feedback to fix problems before a larger shipment. For
larger companies, such as IBM, Microsoft and Oracle, many of their
products are used internally, thus forming a good beta audience.
Techniques to best conduct such an internal Beta test are essential for
us to obtain good coverage and efficiently use internal resources. This
best practice has
everything to do with Beta programs though on a smaller scale to best
leverage it and reduce cost and expense of an external Beta.
8. Automated Test Execution
The goal of automated test execution is that we minimize the amount of manual work involved in test execution and gain higher coverage with a larger number of test cases. The automated test execution has a significant impact on both the tools sets for test execution and also the way tests are designed. Integral to automated test environments is the test oracle that verifies current operation and logs failure with diagnosis information. This is a best practice fairly well understood in some segments of software testing and not in others. The best practice, therefore, needs to leverage what is known and then develop methods for areas where automation is not yet fully exploited.
9. Beta Programs (see Internal Betas)
10. ‘Nightly’ Builds
The concept of a nightly build has been in vogue for a long time. While
every build is not necessarily done every day, the concept captures
frequent builds from changes that are being promoted into the change
control system. The advantage is firstly, that if a major regression
occurs because of errors recently generated, they are captured quickly.
Secondly, regression tests can be run in the background. Thirdly, the
newer releases of
software are available to developers and testers sooner.
Foundational
- User Scenarios
- Usability Testing
- In-process ODC feedback loops
- Multi-release ODC/Butterfly profiles
- Requirements for test planning
- Automated test generation
11. User Scenarios
As we integrate multiple software products and create end user
applications that invoke one or a multiplicity of products, the task of
testing the end user features gets complicated. One of the viable
methods of testing is to develop user scenarios that exercise the
functionality of the applications. We broadly call these User Scenarios.
The advantage of the user scenario is that it tests the product in the
ways that most likely reflect customer usage, imitating what Software
Reliability Engineering has for long advocated under the concept of
Operational Profile. A further advantage of using user scenarios is that
one reduces the complexity of writing test cases by moving to testing
scenarios than features of an application. However, the methodology of
developing user
scenarios and using enough of them to get adequate coverage at a
functional level continues to be a difficult task. This best practice
should capture methods of recording user scenarios and developing test
cases based on them. In addition it could discuss potential diagnosis
methods when specific failure scenarios occurs.
12. Usability Testing
For a large number of products, it is believed that the usability becomes the final arbiter of quality. This is true for a large number of desktop applications that gained market share through providing a good user experience. Usability testing needs to not only assess how usable a product is but also provide feedback on methods to improve the userexperience and thereby gain a positive quality image. The best practice for usability testing should also have knowledge about advances in the area of Human Computer Interface.
13. In-Process ODC Feedback Loops
Orthogonal defect classification is a measurement method that uses the
defect stream to provide precise measurability into the product and the
process. Given the measurement, a variety of analysis techniques have
been developed to assist management and decision making on a range of
software engineering activities. One of the uses of ODC has been the
ability to close feedback loops in a software development process, which
has
traditionally been a difficult task. While ODC can be used for a variety
of other software management methods, closing of feedback loops has been
found over the past few years to be a much needed process improvement
and cost control mechanism.
14. Multi-Release ODC Butterfly
A key feature of the ODC measurement is the ability to look at multiple releases of a product and develop a profile of customer usage and its impact on warranty costs and overall development efficiencies. The technology of multi-release ODC Butterfly analysis allows a product manager to make strategic development decisions so as to optimize development costs, time to market, and quality issues by recognizing customer trends, usage patterns, and product performance.
15. “Requirements” for Test Planning
One of the roles of software testing is to ensure that the product meets
the
requirements of the clientele. Capturing the requirements therefore
becomes an essential
part not only to help develop but to create test plans that can be used
to gauge if the
developed product is likely to meet customer needs. Often times in
smaller development
organizations, the task of requirements management falls prey to
conjectures of what
ought to be developed as opposed to what is needed in the market.
Therefore,
requirements management and its translation to produce test plans is an
important step.
This practice needs to be understood and executed with a holistic view
to be successful.
16. Automated Test Generation
Almost 30% of the testing task can be the writing of test cases. To
first order of approximation, this is a completely manual exercise and a
prime candidate for savings through automation. However, the technology
for automation has not been advancing as rapidly as one would have hoped
. While there are automated test generation tools they often produce too
large a test set, defeating the gains from automation. On the other,
there do exist a few techniques and tools that have been recognized as
good methods for automatically generating test cases. The practice needs
to understand which of these methods are successful and in what
environments they are viable. There is a reasonable amount of learning
in the use of these tools or methodologies but they do pay off past the
initial ramp up.
Incremental
- Teaming testers with developers
- Code coverage (SWS)
- Automated environment generator (Drake)
- Testing to help ship on demand
- State task diagram (Tucson)
- Memory resource failure simulation
- Statistical testing (Tucson)
- Semiformal methods (e.g.SDL)
- Check-in tests for code
- Minimizing regression test cases
- Instrumented versions for MTTF
- Benchmark trends
- Bug bounties
17. Teaming Testers with Developers
Its been recognized for a long time that the close coupling of testers
with developers improves both the test cases and the code that is
developed. An extreme case of this practice is Microsoft, where every
developer is shadowed with a tester. Needless to say, one does not have
to resort to such an extreme to gain the benefits of this teaming. This
practice should, therefore, understand the kinds of teaming that are
beneficial, and the
environments in which they may be employed. The value of a best practice
such as teaming should be therefore more than just concept. Instead it
should include guidance on forming the right team while reporting the
pitfalls and successes experienced.
18. Code Coverage
The concept of code coverage is based on a structural notion of the
code. Code coverage implies a numerical metric that measures the
elements of the code that have been exercised as a consequence of
testing. There are a host of metrics: statements, branches, and data
that are implied by the term code coverage. Today, there exist several
tools that assist in this measurement and additionally provide guidance
on covering elements not yet exercised. This is also an area that has
had considerable academic play and has been an
issue of debate for a couple of decades. The practice of code coverage
should therefore carry information about the tools and the methods of
how to employ code coverage and track results from the positive benefits
experienced.
19. Automated Environment Generator
A fairly time-consuming task is the setting up a test environments to
execute test cases. Thes
e tasks can take greater amounts of time as we have more operating
systems, more versions, and code that runs on multiple platforms. The
shear task of bringing up an environment and taking it down for a
different set of test cases can dominate the calendar in system test.
Tools that can automatically set up environments, run the test cases,
record the results, and then automatically reconfigure to a new
environment, have high value. The IBM Hursley Lab has developed a tool
called DRAKE that does precisely. This best practice should therefore
capture the issues, tools, and techniques that are associated with an
environments set up, break down, and automatic running of test cases.
20. State Task Diagram
This practice captures the functional operations of an application or a
module in the form of a state transi
tion diagram. The advantages of doing so allows one to create test cases
automatically or create coverage metrics that are closer to the
functional decomposition of the application. There are a fair number of
tools that allow for capturing Markov models which may be useful for
this practice. The difficulties have usually been in
extracting the functional view of a product which may not exist in any
computable or documented form and producing the state transition
diagram. One of the automated test generation tools called Test Master
from Teradyne actually uses state task diagrams for the generation of
functional test. This practice has possibly more than one application
and the keepers of the practice need to capture the tools, the methods,
and its uses.
21. Memory Resource Failure Simulation
This practice addresses a particular software bug, namely the loss of
memory because of poor heap management or the lack of garbage
collection. It is a fairly serious problem for many C programs in
Unix applications. It also exists on other platforms and languages.
There are commercial tools available to help simulate memory failure and
check for memory leaks. The practice should be generic and develop
methods and techniques for use on different platforms and language
environments.
22. Statistical Testing
The concept of statistical testing was invented by the late Harlan Mills
(IBM Fellow who invented Clean Room software engineering). The central
idea is to use software testing as a means to assess the reliability of
software as opposed to a debugging mechanism. This is quite contrary to
the popular use of software testing as a debugging method. Therefore
one needs to recognize that the goals and motivations of statistical
testing are different fundamentally. There are many arguments as to why
this might indeed be a very valid approach. The theory of this is buried
in the concepts of Clean Room software engineering and are worthy of a
separate discussion. Statistical testing needs to exercise the software
along an operational profile and then measure
interfailure times that are then used to estimate its reliability. A
good development process should yield an
increasing mean time between failure every time a bug is fixed. This
then becomes the release criteria and the conditions to stop software
testing.
23. Semi-Formal Methods
The origin of formal methods in software engineering dates a couple of
decades. Over the years it has made considerable progress in some
specific areas such as protocol implementation. The key concept of a
formal method is that it would allow for a verification of the program
as opposed to testing and debugging. The verification methods are
varied, some of which are theorem provers, while some of them simulation
against
which assertions can be validated. The vision of formal methods has
always been that if the specification of software is succinctly captured
it could lead to automatic generation of code, requiring minimal
testing. In the practice there has been much debate on the viability of
semi-formal methods and to a large extent the industry ignores it.
However, one must recognize a very key contribution from IBM’s Hursley
Lab. This is where the kernel of CICS was implemented after a formal
specification writte n in Z. A semi-formal method is one where the
specifications captured may be in state transition diagrams or tables
that can then be used for even test generation. IBM’s Zurich Research
Lab has done some work in this area and very successfully used this for
protocol implementations. The best practice in semi-formal methods ought
to capture our experience and also guide places where such applications
may be viable.
24. Check-in Tests for Code
The idea of a check-in test is to couple an automatic test program (usually a regression test) with the change control system. Microsoft has been known to employ such a system very well. This allows for an automatic test run on recently changed code so that the chances of the code breaking the build are minimized. In fact, Microsoft’s change control system and build are supposedly set up such that unless the code passes the test, it does not get promoted into the next build.
25. Minimizing Regression Test Cases
In organizations that have a legacy of development and of products that
have matured over many releases, it is not uncommon to find regression
test buckets that are huge. The negative consequence of such large test
buckets is that they take long to execute. At the same time, it is often
unclear as to which of these test cases are duplicative providing little
additional value. There are several methods to minimize the regression
tests.
One of the methods looks at the code coverage produced , and distill
test cases to a minimal set. One must note that this method, though
attractive, does confuse a structural metric with a functional test.
Never the less, it is a way to implement the minimization.
26. Instrumented Versions for MTTF
An opportunity that a beta program provides is that one get a large sample of users to test the product. If the product is instrumented so that failures are recorded and returned to the vendor, they would yield an excellent source to measure the mean time between failure of the software. There are several uses for this metric. Firstly, it can be used as a gauge to enhance the product’s quality in a manner that would be meaningful to a user. Secondly, it allows us to measure the mean time between failure of the same product under different customer profiles or user sets. Thirdly, it can be enhanced to additionally capture first failure data that could benefit the diagnosis and problem determination. Microsoft has claimed that they are able to do at least the first two through instrumented versions that they ship in their betas.
27. Benchmark Trends
Benchmarking is a broad concept that applies to many disciplines in different areas. In the world of software testing, we could interpret this to mean the techniques and the performance of testing methods as experienced by other software developers. Today there is not an avenue to regularly exchange such information and compare benchmarks. This best practice could be initiated by benchmarking across IBM Labs and then advance the practice to include a larger pool with competitors and customers.
28. Bug Bounties
We have heard that bug bounties were used in Microsoft and we know that they have been used in IBM during the 10X days. Bug bounties refers to our initiatives that charge the organization with a focus on detecting software bugs. At times providing rewards too. The experience states that such effort tend to identify a larger than usual number of bugs. Clearly additional resource is necessary to fix the bugs. But the net result is a higher quality product.