Association Rule Mining: the Apriori Algorithm

MartinLwx included in category ML-DL

2025-07-30 2025-07-30 1134 words 5 minutes

Contents

Introduction

In my recent work, I’ve been analyzing the correlation between various features in Android APKs. These features include IP characteristics, URL attributes, permission settings, and more. Feature correlation refers to identifying relationships between different features from the data, such as certain feature combinations that frequently appear together.

Relying solely on manual analysis would be impractical given the massive dataset—that’s when I remembered an algorithm from my data mining course: the Apriori algorithm! :)

Apriori algorithm

Motivation

The Apriori algorithm is an association rule mining algorithm used to identify the frequent itemsets in data. Frequent itemsets are groups of items that often appear together. From these frequent itemsets, we can then derive association rules.

Let’s use a concrete example to illustrate: Suppose you currently have many users’ shopping cart lists, each containing multiple items. The Apriori algorithm can help you discover whether there are associations between the items in different users’ shopping carts, such as which items frequently appear together in the same shopping cart.

Before exploring how the Apriori algorithm works, let’s understand these key metrics.

Support

$$ Support(A) = \frac{\texttt{Count(records contains item A) }}{\texttt{Count(records)}} $$

In probability terms, the support of $A$, $Support(A)$, is simply the marginal probability $P(A)$.

Confidence

$$ Confidence(A\rightarrow B)= \frac{\texttt{Count(records contains item A and item B)}}{\texttt{Count(records contains item A)}} $$

In probability terms, the confidence of $A\rightarrow B$ is simply the conditional probability $P(B|A)$. Thus, the association rule can be written as $A\rightarrow B$, which represents the probability of $B$ occurring given that $A$ has occurred.

Lift

$$ \begin{split} Lift(A\rightarrow B)&= \frac{\texttt{Count(records contains item A and item B)}}{\texttt{Count(records contains item A)}\times\texttt{Count(records contains item B)}}\times{\texttt{Count(records)}}\\\ &=\frac{Confidence(A\rightarrow B)}{Support(B)} \end{split} $$

The lift metric measures the strenth of association rule $A\rightarrow B$

If they are independent, then $Lift(A\rightarrow B)=1$
If they are positively correlated, then $Lift(A\rightarrow B)=1$
If they are negatively correlated, then $Lift(A\rightarrow B)=1$

The details

The Apriori algorithm consists of two phases.

Frequent itemsets mining

Tip

For simplicity, let’s use $k$-itemset to represent a itemset with size $k$.

Frequent itemsets mining presents a computational challenge due to the exponential number of potential itemsets. From set theory, we know that a set with size $n$ has $2^n-1$ non-empty subsets. This makes a brute-force approach computationally infeasible for large $n$. Can we do better than this?

The apriori algorithm achieves efficiency by leveraging a key insight: all subsets of a frequent itemset must themselves be frequent. This nice property can be used to reduce the search space: If an itemset is not frequent, then all its supersets are also infrequent. Let’s call it Apriori property :)

Let me show the details of Apriori algorithm.

Input: the original data with $n$ distinct items and the minimal support $\epsilon_{support}$

Output: the frequent sets

Algorithm:

Generate $n$ $1$-itemset and calculate the support metric for each itemset.
Check each candidate $1$-itemset and remove those with support below $\epsilon_{support}$, leaving only frequent $1$-itemset.
Generate candidate $2$-itemset by combining frequent $1$-itemset (Note: Any resulting $1$-itemsets should be ignored).
(Apriori property) Check all $1$-itemset subsets of each candidate $2$-itemset, and remove those with support below $\epsilon_{support}$.
Check each candidate $2$-itemset, calculate support for each candidate, and remove those with support below $\epsilon_{support}$, leaving only frequent $2$-itemset.
Generate candidate $3$-itemset by combining frequent $1$-itemset (Note: Any resulting itemsets whose size are not $3$ should be ignored).
(Apriori property) Check all $2$-itemset subsets of each candidate $3$-itemset, and remove those with support below $\epsilon_{support}$.
Check each candidate $3$-itemset, calculate support for each candidate, and remove those with support below $\epsilon_{support}$, leaving only frequent $3$-itemset.
…

You should see the pattern now: we always use frequent $k-1$-itemset to generate candidate $k$-itemset. Thus, if we have no frequent $k-1$-itemsets, the Apriori algorithm will just stop.

Let’s me draw a flowchart using Mermaid.

flowchart
    itemsets --> gen(Generate k-itemset candidates)
    gen --> prune1("Prune k-itemsets if ANY k-1-itemset < min_support")
    prune1 --> prune2(Prune k-itemsets if < min_support)
    prune2 -->freq(Get freq k-itemsets)
    freq --> |increase k| gen

Association rule mining

Input: all the frequent itemsets we found in the last phase and the minimal confidence $\epsilon_{conf}$.

Algorithm：

For each frequent itemset $I$
- Enumerate all non-empty sets $S$ of $I$
- Generate the association rule $S\rightarrow I-S$ and check whether its confidence ($Suport(I)/Support(S)$) are greater or equal than $\epsilon_{conf}$

Apriori Example

Let’s get our hands dirty by running the Ariori algorihtm on a small example, which I borrowed from the official efficient-apriori documentation¹.

Settings are shown below

Minimal Support: 2
Minimal Confidence: 0.8

The original data are

transactions = [('eggs', 'bacon', 'soup'),
                ('eggs', 'bacon', 'apple'),
                ('soup', 'bacon', 'banana')]

Let’s generate $1$-itemset first and calculate support for each itemset.

[
    ("apple", 1),
    ("bacon", 3),
    ("banana", 1),
    ("eggs", 2),
    ("soup", 2),
]

Since both apple and banana have support counts below 2, we remove them, leaving the final frequent $1$-itemset.

[
    ("bacon", 3),
    ("eggs", 2),
    ("soup", 2),
]

Let’s combine frequent $1$-itemset and generate candidate $2$-itemset.

[
    (("bacon", "eggs"), 2),
    (("bacon", "soup"), 2),
    (("eggs", "soup"), 1),
]

Each $1$-itemset subset of each $2$-itemset is frequent, so we only need to check the support of each $2$-itemset. Apparently, the support count of ("eggs", "soup") is below 2. After removing it, we got the final frequent $2$-itemset.

[
    (("bacon", "eggs"), 2),
    (("bacon", "soup"), 2),
]

Let’s generate $3$-itemset in a similar way.

[
    (("bacon", "eggs", "soup"), 1),
]

Among all its $2$-itemset subsets (("bacon", "eggs"), ("bacon", "soup"), ("eggs", "soup")), ("eggs", "soup") has a support count below 2. Thus, we know that it can’t be a frequent itemset.

Finally, we have the following frequent itemsets.

[
    ("bacon", 3),
    ("eggs", 2),
    ("soup", 2),
    (("bacon", "eggs"), 2),
    (("bacon", "soup"), 2),
]

Let’s create some association rules now and check whether they meet our requirements (note that the minimal confidence is 0.8)

For frequent $2$-itemset ("bacon", "eggs"), it may have these rules
- "bacon" -> "eggs" with confidence 2/3 = 0.67，❌
- "eggs" -> "bacon" with confidence 2/2 = 1, ✅
For frequent $2$-itemset ("bacon", "soups"), it may have these rules
- "bacon" -> "soup" with confidence 2/3 = 0.67，❌
- "soup" -> "bacon" with confidencel 2/2 = 1, ✅

Apriori algorithm Python library

Tip

I recommend using efficient-apriori against mlxtend as it’s more performant.

The usage of this Python library can be demonstrated using the following snippet (note that the input should have type list[tuple]).

from efficient_apriori import apriori
transactions = [('eggs', 'bacon', 'soup'),
                ('eggs', 'bacon', 'apple'),
                ('soup', 'bacon', 'banana')]
                
itemsets, rules = apriori(transactions, min_support=0.5, min_confidence=1)

Each rule has these properties.

confidence
lift
lhs and rhs are frequent itemsets in the corresponding side of an association rule.
count_lhs and count_rhs

If you run this code snippet, you will get these frequent itemsets.

{1: {('eggs',): 2, ('bacon',): 3, ('soup',): 2},
 2: {('bacon', 'eggs'): 2, ('bacon', 'soup'): 2}}

And the following association rules.

[{eggs} -> {bacon}, {soup} -> {bacon}]

The answer is consistent with our manual derivation.

efficient-apriori ↩︎

Contents

Association Rule Mining: the Apriori Algorithm

Introduction

Apriori algorithm

Motivation

Related metrics

Support

Confidence

Lift

The details

Frequent itemsets mining

Association rule mining

Apriori Example

Apriori algorithm Python library