Class Hierarchy Analysis: a quick way to generate call graph
The key to call graph construction
For an OOP programming language, the key challenge in call graph construction is handling the virtual call, as it may involve multiple target methods, as shown in the following table1.
Static Call | Special Call | Virtual Call | |
---|---|---|---|
Instruction | invokestatic |
invokespecial |
invokeinterface, invokevirtual |
Receiver Objects | ❌ | ✅ | ✅ |
Target Methods | Static Method | Constructor, Private Instance Method, Superclass Instance Method | Other Instance Method |
Count of Possible Target Methods | 1 | 1 | $\ge 1$ (polymorphism) |
Determinancy | Compile-time | Compile-time | Run-time |
The method call and method signature
Let’s take Java as an example; a method call may have this form.
A a = ...
a.foo()
The a.foo()
here is a method call and the receiver of the method call is a
which has the declaring type A
.
The method signature is our main concern, as it serves as the unique identifier. It can be broken into multiple parts2
- Class Type: the declaring type of receiver
- Subsignature
- Method Name
- Return Type
- Parameter Types
Note that I define method name, return type, and parameter types as Subsignature and I will explain why later.
For simplicity, let’s introduce a notation for method call2
<Class Type>: <Return Type> <Method Name>(<Parameter Types>)
Then the method call a.foo()
can be represented as A: foo()
.
The Motivations behind the Class Hierarchy Analysis
The core idea of the Class Hierarchy Analysis (CHA) algorithm is to deduce the possible target methods based on the declaring type of the receiver.
As its name suggests, CHA relies on the class hierarchy information and assumes that the receiver may point to its declaring type as well as all the subclasses.
Sometimes, the receiver’s declaring type is an interface, and CHA assumes that it may point to all classes that implement the interface.
CHA Algorithm
You only need to under two functions ($\texttt{Dispatch}$ and $\texttt{Resolve}$) in order to understand the CHA Algorithm
Dispatch function
Input:
- Current class type (denoted as $c$)
- The method signature we are considering (denoted as $m$)
Output: the target method being
Algorithm: Let’s consider the function call $\texttt{Dispatch}(c,m)$
- If there exists a non-abstract method with a method signature equal to $m$ (excluding class type, i.e., the subsignature mentioned above), then we have found the target method.
- Otherwise, call $\texttt{Dispatch}(c’,m)$, where $c’$ is the superclass of $c$
Wrap-up: Understanding $\texttt{Dispatch}$ is easy in my opinion, as it is the way we infer the target method manually.
Resolve function
You may simply regard the call site as an invoke statement (method call).
Input: call site
Output: all possible target methods for this call site
Algorithm
- Initialize an empty set $T$.
- Define the method signature of the call site as $m$.
- There are different cases:
- If $m$ is a static call, return ${ m }$.
- If $m$ is a special call, retrieve its declaring type $c^m$ and return $T= {\texttt{Dispatch}(c^m, m)}$.
- If $m$ is a virtual call, retrieve its declaring type $c$. For $c$ and all its subclass (both direct and indirect) $c’$, call $\texttt{Dispatch}(c’, m)$ and add all results to $T$.
Wrap-up: From the algorithm, we see that the CHA algorithm enumerates all possible cases when resolving a virtual call.
Call Graph Construction with CHA
Input: the entry method, such as main
in the Java.
Output: Call graph
Algorithm:
- Initialization
- $\texttt{WorkList} = {Entry\ Method}$
- $\texttt{CG} = \emptyset$, which is a edge set.
- $\texttt{RM} = \emptyset$, that is reached methods. It also serves as a hash set to check whether a method has been visited.
- While $\texttt{WorkList}$ is not empty:
- Get a method (denoted as $m$) from $\texttt{workList}$.
- If $m$ is not in $\texttt{RM}$:
- Add $m$ to $\texttt{RM}$.
- For each call site $cs$ in method $m$:
- $T=\texttt{Resolve}(cs)$
- For each method $m’$ in $T$
- Add the call edge $cs\rightarrow m’$ to $\texttt{CG}$.
- Add method $m’$ to $\texttt{WorkList}$ 里面
- Return $\texttt{CG}$
CHA Example. ArkAnalyzer
Recently, I was reading the source code of ArkAnalyzer3, and its call graph construction is based on the CHA algorithm. The following code provides an example of what the implementations look like. The full code is hosted here.
Part of the source code is omitted for brevity.
The method signature
The method signature corresponds to the class MethodSignature
export class MethodSignature {
private declaringClassSignature: ClassSignature;
private methodSubSignature: MethodSubSignature;
...
}
It has two fields: ClassSignature
and MethodSubSignature
.
export class ClassSignature {
private declaringFileSignature: FileSignature;
private className: string;
...
}
export class MethodSubSignature {
private methodName: string;
private parameters: MethodParameter[];
private returnType: Type;
private staticFlag: boolean;
}
If you combine ClassSignature
with MethodSubSignature
then you would get the full method signature we mentioned above.
The entry method of Call Graph Construction
The entry method of call graph generation is makeCallGraphCHA
, whose parameters are a method signature array.
public makeCallGraphCHA(entryPoints: MethodSignature[]): CallGraph {
let callGraph = new CallGraph(this);
let callGraphBuilder = new CallGraphBuilder(callGraph, this);
callGraphBuilder.buildClassHierarchyCallGraph(entryPoints);
return callGraph;
}
public buildClassHierarchyCallGraph(entries: Method[], displayGeneratedMethod: boolean = false): void {
...
let classHierarchyAnalysis: ClassHierarchyAnalysis =
new ClassHierarchyAnalysis(this.scene, this.cg);
classHierarchyAnalysis.start(displayGeneratedMethod);
}
CHA implementation
Let’s take a closer look at the start
method.
protected init(): void {
this.processedMethod = new Set();
this.cg.getEntries().forEach((entryFunc) => {
this.workList.push(entryFunc);
})
}
public start(displayGeneratedMethod: boolean): void {
this.init();
while (this.workList.length !== 0) {
const method = this.workList.shift() as FuncID;
const cgNode = this.cg.getNode(method) as CallGraphNode;
if (this.processedMethod.has(method) || cgNode.isSdkMethod()) {
continue;
}
...
this.processMethod(method).forEach((cs: CallSite) => {
let me = this.cg.getArkMethodByFuncID(cs.calleeFuncID);
this.addCallGraphEdge(method, me, cs, displayGeneratedMethod);
if (!this.processedMethod.has(cs.calleeFuncID)) {
this.workList.push(cs.calleeFuncID);
this.processedMethod.add(cs.callerFuncID);
}
})
}
}
It can be seen from the code above that the code is indeed the CHA algorithm’s implementation.
Dispatch + Resolve
In the ArkAnalyzer, the logics of $\texttt{Dispatch}$ and $\texttt{Resolve}$ are put in the resolvecall
method, which is called by the processMethod
method.
protected processMethod(methodID: FuncID): CallSite[] {
...
cfg.getStmts().forEach((stmt) => {
if (stmt.containsInvokeExpr()) {
this.resolveCall(cgNode.getID(), stmt).forEach(stmt => calleeMethods.push(stmt));
}
})
return calleeMethods;
}
The source code of resolveCall
is
public resolveCall(callerMethod: NodeID, invokeStmt: Stmt): CallSite[] {
let invokeExpr = invokeStmt.getInvokeExpr();
let resolveResult: CallSite[] = [];
if (!invokeExpr) {
return [];
}
// process anonymous method call
this.getParamAnonymousMethod(invokeExpr).forEach(method => {
resolveResult.push(
new CallSite(
invokeStmt,
undefined,
this.cg.getCallGraphNodeByMethod(method).getID(),
callerMethod,
)
);
});
// 返回方法
let calleeMethod = this.resolveInvokeExpr(invokeExpr);
if (!calleeMethod) {
return resolveResult;
}
if (invokeExpr instanceof ArkStaticInvokeExpr) {
// NOTE: Case 1. Static Method Call
resolveResult.push(
new CallSite(
invokeStmt,
undefined,
this.cg.getCallGraphNodeByMethod(calleeMethod!.getSignature()).getID(),
callerMethod!
)
);
} else {
let declareClass = calleeMethod.getDeclaringArkClass();
// NOTE: Case 2. Virtual Call
this.getClassHierarchy(declareClass).forEach((arkClass: ArkClass) => {
if (arkClass.isAbstract()) {
return;
}
let possibleCalleeMethod = arkClass.getMethodWithName(calleeMethod!.getName());
if (possibleCalleeMethod && possibleCalleeMethod.isGenerated() &&
arkClass.getSignature().toString() !== declareClass.getSignature().toString()) {
// remove the generated method in extended classes
return;
}
if (possibleCalleeMethod && !possibleCalleeMethod.isAbstract()) {
resolveResult.push(
new CallSite(
invokeStmt,
undefined,
this.cg.getCallGraphNodeByMethod(possibleCalleeMethod.getSignature()).getID(),
callerMethod
)
);
}
});
}
return resolveResult;
}
Compared to the standard CHA algorithm, it seems that ArkAnalyzer’s implementation takes all non-static method calls as virtual calls.
Wrap-up
That’s all for the CHA algorithm. It’s simple and fast to compute. However, it is not very precise and usually has false positives. For example,
A a = new B();
a.foo();
The CHA algorithm would report both A
and B
are possible target methods for foo
, as it doesn’t take new B()
into account. This drawback can be remedied by pointer analysis, which I will discuss in the future :0