Source Code Similarity Analysis across Multiple Software Variants
Software reuse approaches, such as software product lines, are known to enable considerable effort and cost savings when developing families of software systems with a significant overlap in functionality. In the practice, however, the need for strategic reuse often becomes apparent only after a number of product variants have already been delivered. The variants are often created in ad-hoc manner – cloning of the original system’s code and changing it according to the specific requirements of the customer is frequently observed in the practice. In such a situation, a reuse approach has to be introduced afterwards based on the already existing product implementations.
The primary contribution of this thesis is a reverse engineering approach for obtaining the information about source code similarity of existing product variants. The approach is based on formalized criteria describing the variant similarity analysis problem. The variant systems are modeled as hierarchical sets of uniquely identifiable elements having known sizes, and the similarity of the variants is expressed using set algebra. The similarity information is available on any abstraction level, from a single code line up to a whole system. A generic analysis framework is proposed, which can be used for diverse system representations and diverse similarity detection algorithms, including clone detection. The approach supports simultaneous analysis of multiple source code variants and proposes visualization concepts that enable easy interpretation of the analysis results even for large systems and a high number of variants. It is hypothesized that the proposed approach allows for obtaining more detailed and more correct variant similarity information with lower analysis effort as compared to the existing approaches. The stated improvement is currently being evaluated empirically.