We live in a world where businesses collect vast amounts of data daily. With so much data available, there comes a need to analyze data and get insights to be able to make sound decisions. Data Mining provides us with tools to unwrap useful knowledge from this data. It is the process of extracting information from large data sets and transforming it into an understandable format for further use.
Sequence Pattern Mining, or Sequential Pattern Mining, a subset of Data Mining, is the process of identifying frequently occurring ordered events or subsequences as patterns. It is highly useful for retail, telecommunications, and other businesses since it helps them detect sequential patterns for targeted marketing, customer retention, and many other tasks.
This article will cover several aspects of Sequence Mining, including the applications, types, algorithms, and challenges. Let’s get started.
What is Sequence Pattern Mining?
Sequence Pattern Mining is defined as follows:
Sequential Pattern Mining is a topic of Data Mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.
Thus, if you come across ordered data, and you extract patterns from the sequence, you are essentially doing Sequence Pattern Mining. The sequence need not have a notion of time, and therefore Sequence Pattern Mining is slightly different from Time-Series Mining.
When you are performing Pattern Mining, you are essentially:
- Finding frequently occurring patterns
- Comparing sequences
- Finding missing sequence items
- Building efficient indexes for sequence information
Sequence Pattern Mining helps companies to discover sequential patterns, and hence it finds several applications across many fields. These are discussed in the following section.
Applications of Sequence Pattern Mining
Before we discuss the technical side of things, let’s examine why it is worth studying Sequence Pattern Mining. As mentioned before, Sequence Pattern Mining finds applications in multiple fields ranging from science, business, and finance to meteorology and geology. A few Sequence Pattern Mining examples are listed below:
- Determination of buying patterns (“If a person bought product A, he is likely to purchase product B”)
- Stock trading (where else do people make huge bets on patterns than in the stock market?)
- Analyzing DNA and protein sequences in computational biology
- Studying website logs to identify a user’s online behavior
- Predicting natural disasters based on past indicative patterns.
- Studying telephone calling patterns
Hevo Data, a Fully-managed Data Pipeline platform, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 150+ Data Sources straight into Data Warehouses, or any Databases. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!
GET STARTED WITH HEVO FOR FREE
Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!
Types of Sequence Pattern Mining Problems
Sequence Pattern Mining can be broadly categorized into two types:
- String Mining: This is the subset of Sequence Pattern Mining that deals with text data in a sequence. The data can contain only a limited number of characters. For example, a DNA sequence contains only the letters ‘A’, ’T’, ’C’, and ’G’, and therefore analysis of the same falls within String Mining. Similarly, finding patterns in ASCII character sequences falls under String Mining.
- Itemset Mining: This is the broader subset of Sequence Pattern Mining that aims to find patterns in ordered datasets. Itemset Mining generally finds use in Marketing and Sales Applications (increasing co-purchases of items that are frequently brought together, cross-promoting products, managing inventory, setting price levels, and so on). A more detailed introduction to Itemset Mining, a subset of Sequence Pattern Mining, can be found here – An Introduction to Big Data Itemset Sequence Pattern Mining.
Sequence Pattern Mining Glossary
In this section, we’ll cover some important terms related to Sequence Pattern Mining that will help you understand Sequence Pattern Mining algorithms in the next section.
Subsequences and Supersequences
A sequence is an ordered list of items, like <a1, a2, …, an>. If there are two sequences A = <a1,a2,a3…,an> and B = <b1,b2,b3,…bm>, then A is a subsequence of B if there exist integers 1<= j1 < j2 <….< jn < m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn. Thus, every element of A needs to be a subset of a unique element of B, and the order should be maintained. If A is a subsequence of B, then B is a supersequence of A.
Take the following example: A = <(abcd),(gh),(yz)> and B = <(abcd),(efgh),(lmn),(xyz)>. Here, as you can see, every element of A is a subset of at least one unique element of B (as indicated by the bold letters) and is in the correct order. Thus, A is the subsequence of B and B is the supersequence of A.
Sequence Database
A Sequence Pattern Mining Database is an ordered collection of elements or events. It is represented as a set of tuples <SID, S> where SID is the Sequence ID and S is the Sequence.
Minimum Support
The minimum number of times a sequence should appear in the dataset for it to be considered frequent.
Prefix and Suffix
If all elements of a sequence are in the alphabetical order, then a sequence E’ =<e1’,e2’,..,em’> is called a prefix of sequence E = <e1,e2,…en> if ei’ = ei for i <= m-1 and em’ is a subset of em.
Thus, given a sequence <a(abc)(ad)d(ef)>, its prefixes are <a>,<aa>, <a(ab)>, <a,(abc)> and so on. The part of the sequence after the prefix is called the suffix or postfix to the prefix.
Projection
If A and B are two sequences such that B is a subsequence of A, then a subsequence A’ of A is called a projection of A w.r.t. B if:
- A’ has the prefix B.
- There exists no proper super-sequence A’’ of A’ (i.e. A’’ not equal to A’ and A’ is a subsequence of A’’) such that A’’ also has prefix B.
Thus, the projection of <a(abc)(ac)d(ef)> w.r.t. <ab> is <(_c)(ac)d(ef)>. The projection of <a(abc)(ac)d(ef)> w.r.t. <ad> is <(ef)>.
Prefix Projected Database
If D is a database containing sequences S1, S2, etc., then a prefix projected database w.r.t. sequence A is a database containing the projections of A in each sequence. Thus, in the above image of the Sequence Pattern Mining Database, the prefix-projected database w.r.t prefix <a> is a database containing the following sequences:
<(abc)(ac)d(cf)>,<(_d)c(bc)(ae)>,<(_b)(df)(cb)>,<(_f)cbc>
Apriori Property of Sequential Patterns
This states that if a sequence S is not frequent, then any supersequence of S is also not frequent. Thus, if <(ab),c,e> is not frequent, then <(ab),(cd),e> is also not frequent, and <(ab),(cd),(ef)> is also not frequent.
Algorithms for Sequence Pattern Mining
In this section of Sequence Pattern Mining, we’ll take a broad look at some algorithms that are used in Sequence Pattern Mining. The purpose is to make you aware of and familiar with these algorithms.
GSP (Generalized Sequential Pattern Mining)
This Sequence Pattern Mining algorithm takes a bottom-up approach to find frequent patterns. Initially, every element is considered as a candidate of length 1. Based on the minimum support, frequent sequences of length 1 are identified.
Now, using Apriori Pruning (discarding supersequences of infrequent sequences of length 1), supersequences of length 2 are constructed as candidates. This process repeats till no more candidates or no frequent sequence can be found. Thus, this process outputs all the frequent sequences from the dataset, starting from length 1.
While this algorithm reduces the search space by Apriori Pruning, it still scans the database multiple times and can generate a large number of candidates if the minimum support is less.
SPADE (Sequential Pattern Discovery Using Equivalence Class)
This Sequence Pattern Mining algorithm identifies each element in each sequence in a dataset with a Sequence ID (SID) and the Element ID (EID). Candidates of length 1 are constructed, and the SIDs and EIDs of all elements where they occur are noted.
Candidates of higher length are constructed with the property that the Element IDs of all the elements in the candidate should be in increasing order. This algorithm also facilitates joins (for example, if a set of SIDs and EIDs are identified for candidates ab and ba, then SIDs and EIDs can be obtained for candidate aba through joins). This algorithm is perhaps best explained in this video.
Non-Apriori Based Algorithms
With the above Apriori-based algorithms, the database is scanned multiple times, and it becomes inefficient to use these for large datasets. In order to deal with these limitations, other algorithms like PrefixSpan, FreeSpan, CloSpan, and others were developed.
Going into the depth of each of these algorithms is beyond the scope of this article, and, frankly, these require semester courses at the university level for in-depth understanding. Let’s just spend some time on our last Sequence Pattern Mining algorithm, PrefixSpan.
PrefixSpan (Prefix-projected Sequential Pattern Mining)
In this algorithm, a divide-and-conquer framework gets used:
- First, the framework finds all length 1 sequential patterns (that qualify for minimum support).
- It then determines the prefix projected database for each of these sequential patterns.
- From the prefix-projected database, the framework evaluates all the length-2 sequential patterns having the same initial prefix.
- It then determines the prefix projected database for these length-2 sequential patterns.
- The framework repeats the same steps recursively till no more sequential patterns can be found.
In this algorithm, no candidate sequence needs to be generated. Another advantage is that the projected database keeps shrinking with each step. Constructing projected databases is the only major cost associated with this algorithm.
An example of PrefixSpan, along with a comparison with GSP and FreeSpan can be found here.
For more details on these non-Apriori algorithms, you can refer to the following resources:
- https://faculty.cc.gatech.edu/~hic/CS7616/pdf/lecture13.pdf
- http://hanj.cs.illinois.edu/pdf/span01.pdf
- https://www.cs.sfu.ca/~jpei/publications/freespan.pdf
- https://www.philippe-fournier-viger.com/spmf/CloSpan.php
Conclusion
In this piece, we obtained a general introduction to Sequence Pattern Mining and its applications. We understood the many forms of Sequence Pattern Mining, the terminologies related to Sequence Pattern Mining, and the popular Apriori-based algorithms. In the end, we provided references for non-Apriori-based algorithms, so you can feed your curiosity and explore more.
Hevo also allows the integration of data from non-native sources using Hevo’s in-built REST API & Webhooks Connector. You can then focus on your key business needs and perform insightful analysis using BI tools.
Give Hevo a try and Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the unbeatable pricing, which will assist you in selecting the best plan for your requirements.
We hope you found this article helpful. Thanks for reading.
Yash is a trusted expert in the data industry, recognized for his role in driving online success for companies through data integration and analysis. With a Dual Degree (B. Tech + M. Tech) in Mechanical Engineering from IIT Bombay, Yash brings strategic vision and analytical rigor to every project. He is proficient in Python, R, TensorFlow, and Hadoop, using these tools to develop predictive models and optimize data workflows.