Building a tool to diff the common functions in two different files in different git branches

Share on:

Recently I came across a scenario which I guess is likely to be a quite commonly encountered one. There was a major feature launch and the PR had modifications across nearly a 100 files which had to be merged into the master branch. Now most of these were for a new dashboard view. But some functions which were used in an existing dashboard had also been refactored and moved to a common module in order to be re-used by the new dashboard. And while being moved, a few changes had been done on them as well.

Now the first thing I wanted to ensure was the the existing dashboard didnt break. In order to test where it was liable to break, I needed to know what had particularly been changed in the functions it depended on. I initially tried around using git’s tooling itself to see if I could get an interface which would let me focus on those changes alone while reviewing. While a couple of git tricks helped, I eventually had to build a small python library to achieve the desired outcome. Documenting the process here, so that it might benefit the future me, or anyone else who manages to stumble into a similar situation.

What git offers

The Github PR view is not quite helpful when the PR is very large and when the functions have been moved around. So if we stick to git’s own diffing utilities, the first thing we have to do is to remove the noise. That is, we should be able to focus only on the files which were modified - ignoring the new files and the deletions. After some googling, I found out that the following git diff command helps

git diff --diff-filter=M master branch

This gives a significantly smaller view to explore. But if you know the specific files that you want to focus on, this is still way too much noise. Then I found out that git diff also allows us to diff just between specific files in two different branches

git diff master:./app/views/databuddy.py refactoring:./app/views/common.py

But this is a file level diff and cannot be used to focus on the function level changes. Particularly if the position of the functions in the file has been shuffled around during the refactoring. I tried searching if there was any tool that would let me do this. Couldn’t find any and decided that it would be quicker to build a library.

Requirements

  1. The tool should be able to fetch the content of files from different git repositories
  2. It should be able to identify the function definitions in both the files, compare their names and identify the common functions
  3. Once the common functions are identified, it should be able to provide a diff view between the contents of the common functions

Fetching file contents from git branches

To interact with git, there is a library available called gitpython. We can get a file handle for a file in a particular branch in a git repo like this


def file_handle_from_git_ref(repo_path, relative_file_path, branch_name="master"):
    repo = Repo.init(repo_path)
    tree = repo.heads[branch_name].commit.tree
    blob = tree[relative_file_path]
    return io.TextIOWrapper(io.BytesIO(blob.data_stream.read()))

And then in order to fetch the content of the file

def source_code_from_git_ref(repo_path, relative_file_path, branch_name="master"):
    return file_handle_from_git_ref(
        repo_path, relative_file_path, branch_name=branch_name).read()

Getting the functions defined in the files

When googling how to obtain the function definitions from a piece of source code, I learned that the process of doing this is called static analysis - where we analyze the code without actually executing it in a runtime (which would become dynamic analysis). This is how IDEs can provide us with so many useful tools like autocompletion. There is a library called jedi which offers many such functionalities. But I also figured out that the specific feature I needed - extracting and analyzing the function definitions - is best done by an inbuilt module, the ast library. The ast module lets us build Abstract Syntax Trees from a given python function. These abstract syntax trees have nodes corresponding to every logical element in the python file.

We can obtain an ast from a given python string by calling ast.parse. Combining this with the functions we already wrote to extract text from a file in a git branch, we can build the following function

def ast_from_git_ref(repo_path, relative_file_path, branch_name="master"):
    return ast.parse(source_code_from_git_ref(
        repo_path, relative_file_path, branch_name=branch_name
    ))

Now from this tree, we can filter out the function definitions and obtain a map from function names to the corresponding ast nodes like this

def function_names_to_ast_nodes_map(ast_tree):
    return {n.name: n for n in ast_tree.body if isinstance(n, ast.FunctionDef)}

Now we can compare two ast nodes to see whether they are the same or if they have been modified. But I quickly figured out that practically it was better to get back the source code for these function definitions and compare the source code strings themselves - since there are a lot better tools available for comparing two blocks of strings. The ast module has a get_source_segment function which lets us do this (Not available in python 2.x)

So we can obtain a map from function names to source code blocks, when an ast tree is given as input, as follows

def function_names_to_source_segments_map(ast_tree, source_code):
    return {
        n.name: ast.get_source_segment(source_code, n) 
        for n in ast_tree.body if isinstance(n, ast.FunctionDef)
    }

Combining this with the previously defined utilities, we can obtain this map when given a reference to a file in a git repo as follows

def function_names_to_ast_nodes_map_from_git_ref(
        repo_path, relative_file_path, branch_name="master"):
    return function_names_to_ast_nodes_map(ast_from_git_ref(
        repo_path, relative_file_path, branch_name=branch_name
    ))

The next step is to write a function which can accept references to two different files and then convert them into a function_name:source_code_block dictionary as above, and then extract only the common functions and their source blocks.

The code will look like this

def common_but_differing_functions_mapped_to_source_codes(repo_path, fileref1, fileref2):
    branch1, filepath1 = fileref1.split(":")
    branch2, filepath2 = fileref2.split(":")
    funcs_to_source_codes_map_1 = function_names_to_source_segments_map_from_git_ref(
        repo_path, filepath1, branch_name=branch1
    )
    funcs_to_source_codes_map_2 = function_names_to_source_segments_map_from_git_ref(
        repo_path, filepath2, branch_name=branch2
    )
    common_funcs = intersection([funcs_to_source_codes_map_1.keys(), funcs_to_source_codes_map_2.keys()])
    return {
        fn: {
            fileref1: funcs_to_source_codes_map_1[fn],
            fileref2: funcs_to_source_codes_map_2[fn]
        }
        for fn in common_funcs
        if funcs_to_source_codes_map_1[fn] != funcs_to_source_codes_map_2[fn]
    }

The above function assumes that the files will be uniquely specified using a reference like this branch_name:./relative/filepath/from/repo/root

Now we have everything needed to build the final consolidated function to compare function definitions from 2 different files in different branches. The only thing remaining is the tool to be used for the comparison. Again a quick googling revealed that python’s difflib module had more than adequate features for this. There are different utility methods available there which provide the diffs in various formats. I found the unified_diff format to be the most suitable (Plus it looks like the format used in the gitk GUI tool)

The consolidated function looks like this

def print_diffs_map(diffs_map):
    for k, v in diffs_map.items():
        print(k)
        print("".join(v))
        print()

def diffs_of_common_functions(repo_path, fileref1, fileref2, print_diffs=False):
    funcs_sources_map = common_but_differing_functions_mapped_to_source_codes(
        repo_path, fileref1, fileref2)
    result = {
        fn: list(difflib.unified_diff(
            funcs_sources_map[fn][fileref1].splitlines(1),
            funcs_sources_map[fn][fileref2].splitlines(1),
            fromfile=fileref1,
            tofile=fileref2
        ))
        for fn in funcs_sources_map.keys()
    }
    if print_diffs:
        print_diffs_map(result)
    return result
comments powered by Disqus