Class Hierarchy Analysis: a quick way to generate call graph

For an OOP programming language, the key challenge in call graph construction is handling the virtual call, as it may involve multiple target methods, as shown in the following table1.

Static Call Special Call Virtual Call
Instruction invokestatic invokespecial invokeinterface, invokevirtual
Receiver Objects
Target Methods Static Method Constructor, Private Instance Method, Superclass Instance Method Other Instance Method
Count of Possible Target Methods 1 1 $\ge 1$ (polymorphism)
Determinancy Compile-time Compile-time Run-time

Let’s take Java as an example; a method call may have this form.

java

A a = ...
a.foo()

The a.foo() here is a method call and the receiver of the method call is a which has the declaring type A.

The method signature is our main concern, as it serves as the unique identifier. It can be broken into multiple parts2

  • Class Type: the declaring type of receiver
  • Subsignature
    • Method Name
    • Return Type
    • Parameter Types

Note that I define method name, return type, and parameter types as Subsignature and I will explain why later.

For simplicity, let’s introduce a notation for method call2

java

<Class Type>: <Return Type> <Method Name>(<Parameter Types>)

Then the method call a.foo() can be represented as A: foo().

The core idea of the Class Hierarchy Analysis (CHA) algorithm is to deduce the possible target methods based on the declaring type of the receiver.

As its name suggests, CHA relies on the class hierarchy information and assumes that the receiver may point to its declaring type as well as all the subclasses.

Tip

Sometimes, the receiver’s declaring type is an interface, and CHA assumes that it may point to all classes that implement the interface.

You only need to under two functions ($\texttt{Dispatch}$ and $\texttt{Resolve}$) in order to understand the CHA Algorithm

Input:

  • Current class type (denoted as $c$)
  • The method signature we are considering (denoted as $m$)

Output: the target method being

Algorithm: Let’s consider the function call $\texttt{Dispatch}(c,m)$

  • If there exists a non-abstract method with a method signature equal to $m$ (excluding class type, i.e., the subsignature mentioned above), then we have found the target method.
  • Otherwise, call $\texttt{Dispatch}(c’,m)$, where $c’$ is the superclass of $c$

Wrap-up: Understanding $\texttt{Dispatch}$ is easy in my opinion, as it is the way we infer the target method manually.

Tip

You may simply regard the call site as an invoke statement (method call).

Input: call site

Output: all possible target methods for this call site

Algorithm

  • Initialize an empty set $T$.
  • Define the method signature of the call site as $m$.
  • There are different cases:
    • If $m$ is a static call, return ${ m }$.
    • If $m$ is a special call, retrieve its declaring type $c^m$ and return $T= {\texttt{Dispatch}(c^m, m)}$.
    • If $m$ is a virtual call, retrieve its declaring type $c$. For $c$ and all its subclass (both direct and indirect) $c’$, call $\texttt{Dispatch}(c’, m)$ and add all results to $T$.

Wrap-up: From the algorithm, we see that the CHA algorithm enumerates all possible cases when resolving a virtual call.

Input: the entry method, such as main in the Java.

Output: Call graph

Algorithm:

  • Initialization
    • $\texttt{WorkList} = {Entry\ Method}$
    • $\texttt{CG} = \emptyset$, which is a edge set.
    • $\texttt{RM} = \emptyset$, that is reached methods. It also serves as a hash set to check whether a method has been visited.
  • While $\texttt{WorkList}$ is not empty:
    • Get a method (denoted as $m$) from $\texttt{workList}$.
    • If $m$ is not in $\texttt{RM}$:
      • Add $m$ to $\texttt{RM}$.
      • For each call site $cs$ in method $m$:
        • $T=\texttt{Resolve}(cs)$
        • For each method $m’$ in $T$
          • Add the call edge $cs\rightarrow m’$ to $\texttt{CG}$.
          • Add method $m’$ to $\texttt{WorkList}$ 里面
  • Return $\texttt{CG}$

Recently, I was reading the source code of ArkAnalyzer3, and its call graph construction is based on the CHA algorithm. The following code provides an example of what the implementations look like. The full code is hosted here.

Tip

Part of the source code is omitted for brevity.

The method signature corresponds to the class MethodSignature

ts

export class MethodSignature {
    private declaringClassSignature: ClassSignature;
    private methodSubSignature: MethodSubSignature;
    ...
}

It has two fields: ClassSignature and MethodSubSignature.

ts

export class ClassSignature {
    private declaringFileSignature: FileSignature;
    private className: string;
    ...
}

export class MethodSubSignature {
    private methodName: string;
    private parameters: MethodParameter[];
    private returnType: Type;
    private staticFlag: boolean;
}

If you combine ClassSignature with MethodSubSignature then you would get the full method signature we mentioned above.

The entry method of Call Graph Construction

The entry method of call graph generation is makeCallGraphCHA, whose parameters are a method signature array.

ts

public makeCallGraphCHA(entryPoints: MethodSignature[]): CallGraph {
    let callGraph = new CallGraph(this);
    let callGraphBuilder = new CallGraphBuilder(callGraph, this);
    callGraphBuilder.buildClassHierarchyCallGraph(entryPoints);
    return callGraph;
}

public buildClassHierarchyCallGraph(entries: Method[], displayGeneratedMethod: boolean = false): void {
    ...
    let classHierarchyAnalysis: ClassHierarchyAnalysis = 
        new ClassHierarchyAnalysis(this.scene, this.cg);
    classHierarchyAnalysis.start(displayGeneratedMethod);
}

Let’s take a closer look at the start method.

ts

protected init(): void {
    this.processedMethod = new Set();
    this.cg.getEntries().forEach((entryFunc) => {
        this.workList.push(entryFunc);
    })
}

public start(displayGeneratedMethod: boolean): void {
    this.init();
    while (this.workList.length !== 0) {
        const method = this.workList.shift() as FuncID;
        const cgNode = this.cg.getNode(method) as CallGraphNode;

        if (this.processedMethod.has(method) || cgNode.isSdkMethod()) {
            continue;
        }

        ...
        
        this.processMethod(method).forEach((cs: CallSite) => {
            let me = this.cg.getArkMethodByFuncID(cs.calleeFuncID);

            this.addCallGraphEdge(method, me, cs, displayGeneratedMethod);

            if (!this.processedMethod.has(cs.calleeFuncID)) {
                this.workList.push(cs.calleeFuncID);
                this.processedMethod.add(cs.callerFuncID);
            }
        })
    }
}

It can be seen from the code above that the code is indeed the CHA algorithm’s implementation.

In the ArkAnalyzer, the logics of $\texttt{Dispatch}$ and $\texttt{Resolve}$ are put in the resolvecall method, which is called by the processMethod method.

ts

protected processMethod(methodID: FuncID): CallSite[] {
    ...
    cfg.getStmts().forEach((stmt) => {
        if (stmt.containsInvokeExpr()) {
            this.resolveCall(cgNode.getID(), stmt).forEach(stmt => calleeMethods.push(stmt));
        }
    })

    return calleeMethods;
}

The source code of resolveCall is

ts

public resolveCall(callerMethod: NodeID, invokeStmt: Stmt): CallSite[] {
    let invokeExpr = invokeStmt.getInvokeExpr();
    let resolveResult: CallSite[] = [];

    if (!invokeExpr) {
        return [];
    }

    // process anonymous method call
    this.getParamAnonymousMethod(invokeExpr).forEach(method => {
        resolveResult.push(
            new CallSite(
                invokeStmt,
                undefined,
                this.cg.getCallGraphNodeByMethod(method).getID(),
                callerMethod,
            )
        );
    });

    // 返回方法
    let calleeMethod = this.resolveInvokeExpr(invokeExpr);
    if (!calleeMethod) {
        return resolveResult;
    }

    if (invokeExpr instanceof ArkStaticInvokeExpr) {
        // NOTE: Case 1. Static Method Call
        resolveResult.push(
            new CallSite(
                invokeStmt,
                undefined,
                this.cg.getCallGraphNodeByMethod(calleeMethod!.getSignature()).getID(),
                callerMethod!
            )
        );
    } else {
        let declareClass = calleeMethod.getDeclaringArkClass();
        // NOTE: Case 2. Virtual Call
        this.getClassHierarchy(declareClass).forEach((arkClass: ArkClass) => {
            if (arkClass.isAbstract()) {
                return;
            }

            let possibleCalleeMethod = arkClass.getMethodWithName(calleeMethod!.getName());

            if (possibleCalleeMethod && possibleCalleeMethod.isGenerated() &&
                arkClass.getSignature().toString() !== declareClass.getSignature().toString()) {
                // remove the generated method in extended classes
                return;
            }

            if (possibleCalleeMethod && !possibleCalleeMethod.isAbstract()) {
                resolveResult.push(
                    new CallSite(
                        invokeStmt,
                        undefined,
                        this.cg.getCallGraphNodeByMethod(possibleCalleeMethod.getSignature()).getID(),
                        callerMethod
                    )
                );
            }
        });
    }

    return resolveResult;
}

Compared to the standard CHA algorithm, it seems that ArkAnalyzer’s implementation takes all non-static method calls as virtual calls.

That’s all for the CHA algorithm. It’s simple and fast to compute. However, it is not very precise and usually has false positives. For example,

java

A a = new B();
a.foo();

The CHA algorithm would report both A and B are possible target methods for foo, as it doesn’t take new B() into account. This drawback can be remedied by pointer analysis, which I will discuss in the future :0