Clone Doctor: Software Clone Detection and Reporting

A Tool that Aids the Tracking and Removal of Duplicate Code to Reduce Maintenance Cost

The cost of maintaining a software system is directly proportional to the size of its source code base. Large software systems typically contain 10-25% duplicated code. This redundancy is caused by the common programming practice of replicating (typically by copy and paste methods, e.g., "cloning") code fragments and then customizing to handle new demands on an application. An IT organization consequently spends corresponding amounts of its budget redundantly maintaining this code; a bug in one code fragment is also a bug in all of its hidden clones. If redundant code is removed, the savings can be significant, because it typically costs at least $1.00 per source line per year to maintain application software in running order. If the redundant code is tracked, changes made to one version can be reliably installed in others.

The CloneDR™, built with the DMS/Software Reengineering Toolkit, identifies not only exact, but near-miss duplicates in software systems and can be used on a wide variety of languages. It will find clones even where different formatting, different variable names and even different code snippets are used. It detects clones in code and/or data declarations.

CloneDR Features

  • Compares files exhaustively across whole systems
  • Available for many different languages and dialects
  • Parameterized to identify exact matches or near misses within specifiable range of difference
  • Clones can be filtered by threshold size
  • Identifies parameterization schemes for matching near-miss clones
  • Uses compiler technology rather than string matching, finding clones irrespective of changed comments, white space, line breaks, and case changes. Proven to minimize false positives.
  • Detects clones in procedural code and/or data declarations
  • Supports analysis of thousands of files/millions of lines of code, utilizing parallel computing with multiple processors
  • Runs under Windows on uni-processor but makes extensive use of symmetric multiprocessor (SMP) capability
  • Eclipse integration for IBM Enterprise languages.

A System Undergoing Clone Detection and Removal

A System Undergoing Clone Removal

In this example, we have a large system (say, a million lines with hundreds or thousands of files) with a block of code implementing Bubble-sort, that has been cloned, modified, and reformatted several times, spread across that huge code base. A Bubble-sort is a performance bug waiting to be encountered: it performs poorly when the number of items to sort is anything other than small.

Eventually some user of the system encounters slowness in sorting and a bug report is filed. Joe Programmer is assigned to fix it, finds one of the instances, determines that the sorting algorithm is a poor choice, fixes it and declares victory. The improved system goes out to the field and gets reinstalled. Up to this point, a significant expense has been to find and repair the problem in the field. Fixing bugs in the field typically costs organizations close to a thousand dollars.

However, the user eventually discovers the performance problem again because the other clones were not fixed. Consequence: the users (perhaps many) time is wasted more than once, the bug is repaired more than once, wasting significant resources. In addition, the users may also conclude the development organization is doing a poor job, because bugs don't get fixed.

The problem here is that Joe, while he may know that the software has clones, has no way of knowing where they are, or whether the code on which is working is cloned. With systems being typically 20% cloned, Joe has a 20% chance of making this mistake for every fix he makes.

What is needed is a tool to either tell Joe about the clones, so that he can competently find and fix all of them, or to remove the clones, so that Joe can fix just the one instance.

In our example, the CloneDR will identify the 3 regions as being clones, and will produce a covering abstraction (detected code skeleton) that shows precisely what they have in common. Developers can also use the detected abstraction to remove the clones.

You can calculate your annual savings if you do careful clone management.

Remarkably, clone detection can also be used to improve test coverage.

Testimonial and Experience

Salion, Inc. has used the CloneDR on their 250,000 line Java application. Inspecting the ClonedDR reports and doing only partial clone removal, by hand, 27,000 lines of code were eliminated. This gave Salion's software an impossible growth curve: its size actually went down. (Does your code base ever get smaller?) "I really think this is the most exciting tool I've ever seen by far" said Dale Churchett, Salion's Software Architect.

A paper on Salion's experience using the CloneDR to help them find abstractions in code can be found here (PDF: 142Kb). The Software Engineering Institute discusses how Salion does product line development using CloneDR as key driver.

How CloneDR Works

Available Copy and paste detection tools work in variety of different ways. You can learn more about how CloneDR works and how it compares to other detectors.

Available for the Following Languages

Evaluation Versions of CloneDR

Evaluation versions of the CloneDR clone analysis tool for Java, C#, C++, COBOL, JavaScript and PHP are available for download. (There are evaluation versions for a variety of other languages as above; check at the download link, or inquire). The analysis tool will process large source systems to determine the percentage of clones, and will display a number of sample clones.

If you want an evaluation CloneDR for another language listed above, inquire at sales@semanticdesigns.com.

Other languages

Semantic Designs can build a CloneDR for almost any language. Contact us for your special case.

For more information: info@semanticdesigns.com    Follow us at Twitter: @SemanticDesigns

Software Clone Detection and Removal