Association Rule Mining: the Apriori Algorithm
Introduction
In my recent work, I’ve been analyzing the correlation between various features in Android APKs. These features include IP characteristics, URL attributes, permission settings, and more. Feature correlation refers to identifying relationships between different features from the data, such as certain feature combinations that frequently appear together.
Relying solely on manual analysis would be impractical given the massive dataset—that’s when I remembered an algorithm from my data mining course: the Apriori algorithm! :)
Apriori algorithm
Motivation
The Apriori algorithm is an association rule mining algorithm used to identify the frequent itemsets in data. Frequent itemsets are groups of items that often appear together. From these frequent itemsets, we can then derive association rules.
Let’s use a concrete example to illustrate: Suppose you currently have many users’ shopping cart lists, each containing multiple items. The Apriori algorithm can help you discover whether there are associations between the items in different users’ shopping carts, such as which items frequently appear together in the same shopping cart.
Related metrics
Before exploring how the Apriori algorithm works, let’s understand these key metrics.
Support
$$ Support(A) = \frac{\texttt{Count(records contains item A) }}{\texttt{Count(records)}} $$
In probability terms, the support of $A$, $Support(A)$, is simply the marginal probability $P(A)$.
Confidence
$$ Confidence(A\rightarrow B)= \frac{\texttt{Count(records contains item A and item B)}}{\texttt{Count(records contains item A)}} $$
In probability terms, the confidence of $A\rightarrow B$ is simply the conditional probability $P(B|A)$. Thus, the association rule can be written as $A\rightarrow B$, which represents the probability of $B$ occurring given that $A$ has occurred.
Lift
$$ \begin{split} Lift(A\rightarrow B)&= \frac{\texttt{Count(records contains item A and item B)}}{\texttt{Count(records contains item A)}\times\texttt{Count(records contains item B)}}\times{\texttt{Count(records)}}\\\ &=\frac{Confidence(A\rightarrow B)}{Support(B)} \end{split} $$
The lift metric measures the strenth of association rule $A\rightarrow B$
- If they are independent, then $Lift(A\rightarrow B)=1$
- If they are positively correlated, then $Lift(A\rightarrow B)=1$
- If they are negatively correlated, then $Lift(A\rightarrow B)=1$
The details
The Apriori algorithm consists of two phases.
Frequent itemsets mining
For simplicity, let’s use $k$-itemset to represent a itemset with size $k$.
Frequent itemsets mining presents a computational challenge due to the exponential number of potential itemsets. From set theory, we know that a set with size $n$ has $2^n-1$ non-empty subsets. This makes a brute-force approach computationally infeasible for large $n$. Can we do better than this?
The apriori algorithm achieves efficiency by leveraging a key insight: all subsets of a frequent itemset must themselves be frequent. This nice property can be used to reduce the search space: If an itemset is not frequent, then all its supersets are also infrequent. Let’s call it Apriori property :)
Let me show the details of Apriori algorithm.
Input: the original data with $n$ distinct items and the minimal support $\epsilon_{support}$
Output: the frequent sets
Algorithm:
- Generate $n$ $1$-itemset and calculate the support metric for each itemset.
- Check each candidate $1$-itemset and remove those with support below $\epsilon_{support}$, leaving only frequent $1$-itemset.
- Generate candidate $2$-itemset by combining frequent $1$-itemset (Note: Any resulting $1$-itemsets should be ignored).
- (Apriori property) Check all $1$-itemset subsets of each candidate $2$-itemset, and remove those with support below $\epsilon_{support}$.
- Check each candidate $2$-itemset, calculate support for each candidate, and remove those with support below $\epsilon_{support}$, leaving only frequent $2$-itemset.
- Generate candidate $3$-itemset by combining frequent $1$-itemset (Note: Any resulting itemsets whose size are not $3$ should be ignored).
- (Apriori property) Check all $2$-itemset subsets of each candidate $3$-itemset, and remove those with support below $\epsilon_{support}$.
- Check each candidate $3$-itemset, calculate support for each candidate, and remove those with support below $\epsilon_{support}$, leaving only frequent $3$-itemset.
- …
You should see the pattern now: we always use frequent $k-1$-itemset to generate candidate $k$-itemset. Thus, if we have no frequent $k-1$-itemsets, the Apriori algorithm will just stop.
Let’s me draw a flowchart using Mermaid.
flowchart itemsets --> gen(Generate k-itemset candidates) gen --> prune1("Prune k-itemsets if ANY k-1-itemset < min_support") prune1 --> prune2(Prune k-itemsets if < min_support) prune2 -->freq(Get freq k-itemsets) freq --> |increase k| gen
Association rule mining
Input: all the frequent itemsets we found in the last phase and the minimal confidence $\epsilon_{conf}$.
Algorithm:
- For each frequent itemset $I$
- Enumerate all non-empty sets $S$ of $I$
- Generate the association rule $S\rightarrow I-S$ and check whether its confidence ($Suport(I)/Support(S)$) are greater or equal than $\epsilon_{conf}$
Apriori Example
Let’s get our hands dirty by running the Ariori algorihtm on a small example, which I borrowed from the official efficient-apriori documentation1.
Settings are shown below
- Minimal Support: 2
- Minimal Confidence: 0.8
The original data are
transactions = [('eggs', 'bacon', 'soup'),
('eggs', 'bacon', 'apple'),
('soup', 'bacon', 'banana')]
Let’s generate $1$-itemset first and calculate support for each itemset.
[
("apple", 1),
("bacon", 3),
("banana", 1),
("eggs", 2),
("soup", 2),
]
Since both apple
and banana
have support counts below 2, we remove them, leaving the final frequent $1$-itemset.
[
("bacon", 3),
("eggs", 2),
("soup", 2),
]
Let’s combine frequent $1$-itemset and generate candidate $2$-itemset.
[
(("bacon", "eggs"), 2),
(("bacon", "soup"), 2),
(("eggs", "soup"), 1),
]
Each $1$-itemset subset of each $2$-itemset is frequent, so we only need to check the support of each $2$-itemset. Apparently, the support count of ("eggs", "soup")
is below 2. After removing it, we got the final frequent $2$-itemset.
[
(("bacon", "eggs"), 2),
(("bacon", "soup"), 2),
]
Let’s generate $3$-itemset in a similar way.
[
(("bacon", "eggs", "soup"), 1),
]
Among all its $2$-itemset subsets (("bacon", "eggs"), ("bacon", "soup"), ("eggs", "soup")
), ("eggs", "soup")
has a support count below 2. Thus, we know that it can’t be a frequent itemset.
Finally, we have the following frequent itemsets.
[
("bacon", 3),
("eggs", 2),
("soup", 2),
(("bacon", "eggs"), 2),
(("bacon", "soup"), 2),
]
Let’s create some association rules now and check whether they meet our requirements (note that the minimal confidence is 0.8)
- For frequent $2$-itemset
("bacon", "eggs")
, it may have these rules"bacon" -> "eggs"
with confidence2/3 = 0.67
,❌"eggs" -> "bacon"
with confidence2/2 = 1
, ✅
- For frequent $2$-itemset
("bacon", "soups")
, it may have these rules"bacon" -> "soup"
with confidence2/3 = 0.67
,❌"soup" -> "bacon"
with confidencel2/2 = 1
, ✅
Apriori algorithm Python library
I recommend using efficient-apriori
against mlxtend
as it’s more performant.
The usage of this Python library can be demonstrated using the following snippet (note that the input should have type list[tuple]
).
from efficient_apriori import apriori
transactions = [('eggs', 'bacon', 'soup'),
('eggs', 'bacon', 'apple'),
('soup', 'bacon', 'banana')]
itemsets, rules = apriori(transactions, min_support=0.5, min_confidence=1)
Each rule
has these properties.
confidence
lift
lhs
andrhs
are frequent itemsets in the corresponding side of an association rule.count_lhs
andcount_rhs
If you run this code snippet, you will get these frequent itemsets.
{1: {('eggs',): 2, ('bacon',): 3, ('soup',): 2},
2: {('bacon', 'eggs'): 2, ('bacon', 'soup'): 2}}
And the following association rules.
[{eggs} -> {bacon}, {soup} -> {bacon}]
The answer is consistent with our manual derivation.