We live in a world where businesses collect vast amounts of data daily. With so much data available, there comes a need to analyze data and get insights to be able to make sound decisions. Data Mining provides us with tools to unwrap useful knowledge from this data. It is the process of extracting information from large data sets and transforming it into an understandable format for further use.
Sequence Pattern Mining, a subset of Data Mining, is the process of identifying frequently occurring ordered events or subsequences as patterns. It is highly useful for retail, telecommunications, and other businesses since it helps them detect sequential patterns for targeted marketing, customer retention, and many other tasks.
This article will cover several aspects of Sequence Pattern Mining, including the applications, types, algorithms, and challenges. Let’s get started.
Table of Contents
- What is Sequence Pattern Mining?
- Applications of Sequence Pattern Mining
- Types of Sequence Pattern Mining Problems
- Sequence Pattern Mining Glossary
- Algorithms for Sequence Pattern Mining
What is Sequence Pattern Mining?
Sequence Pattern Mining is defined as follows:
Sequential Pattern Mining is a topic of Data Mining concerned with finding statistically relevant patterns between data examples where the values are delivered in a sequence.
Thus, if you come across ordered data, and you extract patterns from the sequence, you are essentially doing Sequence Pattern Mining. The sequence need not have a notion of time, and therefore Sequence Pattern Mining is slightly different from Time-Series Mining.
When you are performing Sequence Pattern Mining, you are essentially:
- Finding frequently occurring patterns
- Comparing sequences
- Finding missing sequence items
- Building efficient indexes for sequence information
Sequence Pattern Mining helps companies to discover sequential patterns, and hence it finds several applications across many fields. These are discussed in the following section.
Applications of Sequence Pattern Mining
Before we discuss the technical side of things, let’s examine why it is worth studying Sequence Pattern Mining. As mentioned before, Sequence Pattern Mining finds applications in multiple fields ranging from science, business, and finance to meteorology and geology. Some of them are listed below:
- Determination of buying patterns (“If a person bought product A, he is likely to purchase product B”)
- Stock trading (where else do people make huge bets on patterns than in the stock market?)
- Analyzing DNA and protein sequences in computational biology
- Studying website logs to identify a user’s online behavior
- Predicting natural disasters based on past indicative patterns.
- Studying telephone calling patterns
Simplify ETL Using Hevo’s No-Code Data Pipeline
Hevo Data, a Fully-managed Data Pipeline platform, can help you automate, simplify & enrich your data replication process in a few clicks. With Hevo’s wide variety of connectors and blazing-fast Data Pipelines, you can extract & load data from 100+ Data Sources straight into Data Warehouses, or any Databases. To further streamline and prepare your data for analysis, you can process and enrich raw granular data using Hevo’s robust & built-in Transformation Layer without writing a single line of code!GET STARTED WITH HEVO FOR FREE
Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!
Types of Sequence Pattern Mining Problems
Sequence Pattern Mining can be broadly categorized into two types:
- String Mining: This is the subset of Sequence Pattern Mining that deals with text data in a sequence. The data can contain only a limited number of characters. For example, a DNA sequence contains only the letters ‘A’, ’T’, ’C’, and ’G’, and therefore analysis of the same falls within String Mining. Similarly, finding patterns in ASCII character sequences falls under String Mining.
- Itemset Mining: This is the broader subset of Sequence Pattern Mining that aims to find patterns in ordered datasets. Itemset Mining generally finds use in Marketing and Sales Applications (increasing co-purchases of items that are frequently brought together, cross-promoting products, managing inventory, setting price levels, and so on). A more detailed introduction to Itemset Mining, a subset of Sequence Pattern Mining, can be found here – An Introduction to Big Data Itemset Sequence Pattern Mining.
Sequence Pattern Mining Glossary
- Subsequences and Supersequences
- Sequence Database
- Minimum Support
- Prefix and Suffix
- Prefix Projected Database
- Apriori Property of Sequential Patterns
In this section, we’ll cover some important terms related to Sequence Pattern Mining that will help you understand Sequence Pattern Mining algorithms in the next section.
Subsequences and Supersequences
A sequence is an ordered list of items, like <a1, a2, …, an>. If there are two sequences A = <a1,a2,a3…,an> and B = <b1,b2,b3,…bm>, then A is a subsequence of B if there exist integers 1<= j1 < j2 <….< jn < m such that a1 ⊆ bj1, a2 ⊆ bj2,…, an ⊆ bjn. Thus, every element of A needs to be a subset of a unique element of B, and the order should be maintained. If A is a subsequence of B, then B is a supersequence of A.
Take the following example: A = <(abcd),(gh),(yz)> and B = <(abcd),(efgh),(lmn),(xyz)>. Here, as you can see, every element of A is a subset of at least one unique element of B (as indicated by the bold letters) and is in the correct order. Thus, A is the subsequence of B and B is the supersequence of A.
A Sequence Pattern Mining Database is an ordered collection of elements or events. It is represented as a set of tuples <SID, S> where SID is the Sequence ID and S is the Sequence.
The minimum number of times a sequence should appear in the dataset for it to be considered frequent.
Prefix and Suffix
If all elements of a sequence are in the alphabetical order, then a sequence E’ =<e1’,e2’,..,em’> is called a prefix of sequence E = <e1,e2,…en> if ei’ = ei for i <= m-1 and em’ is a subset of em.
Thus, given a sequence <a(abc)(ad)d(ef)>, its prefixes are <a>,<aa>, <a(ab)>, <a,(abc)> and so on. The part of the sequence after the prefix is called the suffix or postfix to the prefix.
If A and B are two sequences such that B is a subsequence of A, then a subsequence A’ of A is called a projection of A w.r.t. B if:
- A’ has the prefix B.
- There exists no proper super-sequence A’’ of A’ (i.e. A’’ not equal to A’ and A’ is a subsequence of A’’) such that A’’ also has prefix B.
Thus, the projection of <a(abc)(ac)d(ef)> w.r.t. <ab> is <(_c)(ac)d(ef)>. The projection of <a(abc)(ac)d(ef)> w.r.t. <ad> is <(ef)>.
Prefix Projected Database
If D is a database containing sequences S1, S2, etc., then a prefix projected database w.r.t. sequence A is a database containing the projections of A in each sequence. Thus, in the above image of the Sequence Pattern Mining Database, the prefix-projected database w.r.t prefix <a> is a database containing the following sequences:
Apriori Property of Sequential Patterns
This states that if a sequence S is not frequent, then any supersequence of S is also not frequent. Thus, if <(ab),c,e> is not frequent, then <(ab),(cd),e> is also not frequent, and <(ab),(cd),(ef)> is also not frequent.
What makes Hevo’s ETL Process Best-In-Class
Providing a high-quality ETL solution can be a difficult task if you have a large volume of data. Hevo’s automated, No-code platform empowers you with everything you need to have for a smooth data replication experience.
Check out what makes Hevo amazing:
- Fully Managed: Hevo requires no management and maintenance as it is a fully automated platform.
- Data Transformation: Hevo provides a simple interface to perfect, modify, and enrich the data you want to transfer.
- Faster Insight Generation: Hevo offers near real-time data replication so you have access to real-time insight generation and faster decision making.
- Schema Management: Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
- Scalable Infrastructure: Hevo has in-built integrations for 100+ sources (with 40+ free sources) that can help you scale your data infrastructure as required.
- Live Support: Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.
Algorithms for Sequence Pattern Mining
- GSP (Generalized Sequential Pattern Mining)
- SPADE (Sequential Pattern Discovery using Equivalence Class)
- Non-Apriori Based Algorithms
- PrefixSpan (Prefix-projected Sequential Pattern Mining)
In this section of Sequence Pattern Mining, we’ll take a broad look at some algorithms that are used in Sequence Pattern Mining. The purpose is to make you aware of and familiar with these algorithms.
GSP (Generalized Sequential Pattern Mining)
This Sequence Pattern Mining algorithm takes a bottom-up approach to find frequent patterns. Initially, every element is considered as a candidate of length 1. Based on the minimum support, frequent sequences of length 1 are identified.
Now, using Apriori Pruning (discarding supersequences of infrequent sequences of length 1), supersequences of length 2 are constructed as candidates. This process repeats till no more candidates or no frequent sequence can be found. Thus, this process outputs all the frequent sequences from the dataset, starting from length 1. A very good example can be found here.
While this algorithm reduces the search space by Apriori Pruning, it still scans the database multiple times and can generate a large number of candidates if the minimum support is less.
SPADE (Sequential Pattern Discovery Using Equivalence Class)
This Sequence Pattern Mining algorithm identifies each element in each sequence in a dataset with a Sequence ID (SID) and the Element ID (EID). Candidates of length 1 are constructed, and the SIDs and EIDs of all elements where they occur are noted.
Candidates of higher length are constructed with the property that the Element IDs of all the elements in the candidate should be in increasing order. This algorithm also facilitates joins (for example, if a set of SIDs and EIDs are identified for candidates ab and ba, then SIDs and EIDs can be obtained for candidate aba through joins). This algorithm is perhaps best explained in this video.
Non-Apriori Based Algorithms
With the above Apriori-based algorithms, the database is scanned multiple times, and it becomes inefficient to use these for large datasets. In order to deal with these limitations, other algorithms like PrefixSpan, FreeSpan, CloSpan, and others were developed.
Going into the depth of each of these algorithms is beyond the scope of this article, and, frankly, these require semester courses at the university level for in-depth understanding. Let’s just spend some time on our last Sequence Pattern Mining algorithm, PrefixSpan.
PrefixSpan (Prefix-projected Sequential Pattern Mining)
In this algorithm, a divide-and-conquer framework gets used:
- First, the framework finds all length 1 sequential patterns (that qualify for minimum support).
- It then determines the prefix projected database for each of these sequential patterns.
- From the prefix-projected database, the framework evaluates all the length-2 sequential patterns having the same initial prefix.
- It then determines the prefix projected database for these length-2 sequential patterns.
- The framework repeats the same steps recursively till no more sequential patterns can be found.
In this algorithm, no candidate sequence needs to be generated. Another advantage is that the projected database keeps shrinking with each step. Constructing projected databases is the only major cost associated with this algorithm.
An example of PrefixSpan, along with a comparison with GSP and FreeSpan can be found here.
For more details on these non-Apriori algorithms, you can refer to the following resources:
In this piece, we obtained a general introduction to Sequence Pattern Mining and its applications. We understood the many forms of Sequence Pattern Mining, the terminologies related to Sequence Pattern Mining, and the popular Apriori-based algorithms. In the end, we provided references for non-Apriori-based algorithms, so you can feed your curiosity and explore more.
Hevo Data is a prizewinning ETL solution to help businesses export data from their sources into their desired Database/destination. Its No-code Data Pipeline provides you with a consistent and reliable solution to manage data transfer from 100+ Data Sources (including 40+ Free Sources) to a wide variety of desired destinations with a few simple clicks.
Hevo also allows the integration of data from non-native sources using Hevo’s in-built REST API & Webhooks Connector. You can then focus on your key business needs and perform insightful analysis using BI tools.
Give Hevo a try and Sign Up for a 14-day free trial and experience the feature-rich Hevo suite first hand. You may also have a look at the unbeatable pricing, which will assist you in selecting the best plan for your requirements.
We hope you found this article helpful. Thanks for reading.