Press "Enter" to skip to content

PostgreSQL- Setting up Full Text Search

PostgreSQL full text search

Introduction:

Setting up Full-Text Search (FTS) in PostgreSQL helps users bring up a fully fledged search engine without a steep learning curve and new softwares to install. It can be used to set up anything from a small catalog website to a huge news website.

You might wonder what’s the big deal with Full-Text Search, you can use LIKE or ILIKE clause any day in a query to perform a search on any given string. The problem with traditional LIKE clause is that it has a very high-performance penalty. That means, it takes a long time to return results as the words inside the text are not indexed. Also, the results won’t be as expected, for e.g. you might search rectify in a document which contains rectifies – while ideally the document should show up in the search result but it doesn’t.
Prerequisites:
Before getting started you need to have a working copy of PostgreSQL installed on your favorite OS.

Step 1: Preparing the Database

We will first create a table

CREATE TABLE tv_series (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    description TEXT NOT NULL,
    creator TEXT NOT NULL
);

Next, we will insert some data to start querying

INSERT INTO tv_series 
    (id, title, description, creator) 
VALUES 
    (1, 'Stranger Things', 'Set in the fictional town of Hawkins, Indiana in the 1980s, the first season focuses on the investigation into the disappearance of a young boy...', 'Duffer Brothers'),
    (2, 'Breaking Bad', 'BB is an American neo-western crime drama television series created and produced by Vince Gilligan. It tells the story of Walter White, a struggling high school chemistry teacher diagnosed with inoperable lung cancer...', 'Vince Gilligan'),
    (3, 'Band of Brothers', 'Band of Brothers is a 2001 American war drama miniseries based on historian Stephen E. Ambrose''s 1992 nonfiction book of the same name. The executive producers were Steven Spielberg and Tom Hanks...', 'Steven Spielberg, Tom Hanks'),
    (4, 'Rick and Morty', 'Rick and Morty is an American animated science-fiction sitcom created by Justin Roiland and Dan Harmon for Cartoon Network''s late-night programming block...', 'Justin Roiland, Dan Harmon');

Step 2: Exploring Data

We will now have to create a document, which is simply just concatenating all the fields we want to perform a search on.

SELECT 
    (title || ' ' || description || ' ' || creator) AS document 
FROM tv_series;

At this stage we just have a string, to make it more actionable we need to transform it using to_tsvector().

SELECT 
    (to_tsvector(title) || to_tsvector(description) || to_tsvector(creator)) AS document 
FROM tv_series;

The result we have got in return is in tsvector form, this is more suitable to perform a full-text search. But if you notice, the result looks weirder than anything we originally had. If you’d notice, you will find that first record in the result has, in fact, reduced the word investigation to investig, also there are some random numbers in front of each word in the result. What really happened is that Postgres just sorted a list of distinct lexemes, which are words that have been normalized to merge different variants of the same word.

Let’s start querying the data. We will be using @@ operator to perform a query and check if a tsvector matches tsquery

SELECT to_tsvector(description) @@ 'rick' FROM tv_series WHERE id = 4;
?column?
----------
t
(1 row)

Yes, it did match. Now let’s try another query.

SELECT to_tsvector(description) @@ 'morty' FROM tv_series WHERE id = 4;
?column?
----------
f
(1 row)

Sadly, the result is negative. To solve this we need to build a tsquery which creates the same lexemes and using the operator @@, casts the string into a tsquery. The following shows the difference between casting and using the function to_tsquery().

SELECT 'morty'::tsquery, to_tsquery('morty');
 tsquery | to_tsquery
---------+------------
 'morty' | 'morti'
(1 row)

But in the case of ‘rick’ the stem is equal to the word.

SELECT 'rick'::tsquery, to_tsquery('rick');
 tsquery | to_tsquery
---------+------------
  'rick' | 'rick'
(1 row)

So we will have to use to_tsquery() for querying documents.

SELECT to_tsvector(description) @@ to_tsquery('morty') FROM tv_series WHERE id = 4;
?column?
----------
t
(1 row)

Step 3: Improving Performance of Full-Text Search

In a production environment we can’t create tsvector at runtime and expect it to run at lightning speed, we will have to pre create it whenever a new entry is inserted in our table. For this, we will create a new column in our table named document_vectors of data type tsvector. Also, we would require indexes to perform queries at speeds higher than that of light. When it comes to tsvector, Postgres provides us with GiST (Generalized Search Tree) and GIN (Generalized Inverted Index) indexes. The main difference between them is, GiST is faster to build but requires additional reads, while on the other hand GIN is slower to build but faster to query when compared to GiST. So if your database is read intensive you must go with GIN.

ALTER TABLE tv_series ADD "document_vectors" tsvector;
CREATE INDEX idx_fts_doc_vec ON tv_series USING gin(document_vectors);

Next update the tsvector of existing rows

UPDATE 
    tv_series 
SET 
    document_vectors = (to_tsvector(title) || to_tsvector(description) || to_tsvector(creator));

For more fun we will insert a new row in the column this time with tsvector of document.

INSERT INTO tv_series
    (id, title, description, creator, document_vectors)
VALUES
    (5, 'Better Call Saul', 'Better Call Saul is an American television crime drama series created by Vince Gilligan and Peter Gould. It is a spin-off prequel of Gilligan''s prior series Breaking Bad.', 'Vince Gilligan, Peter Gould', (to_tsvector('Better Call Saul') || to_tsvector('Better Call Saul is an American television crime drama series created by Vince Gilligan and Peter Gould. It is a spin-off prequel of Gilligan''s prior series Breaking Bad.') || to_tsvector('Vince Gilligan, Peter Gould')));

Finally, let’s search for my favorite series Breaking Bad.

SELECT id, title, creator FROM tv_series WHERE document_vectors @@ to_tsquery('break');
 id | title            | creator
----+------------------+-----------------------------
  2 | Breaking Bad     | Vince Gilligan
  5 | Better Call Saul | Vince Gilligan, Peter Gould
(2 rows)

We were able to successfully query all the documents which had the word Breaking Bad in it with just the word ‘break’.

For more such posts, visit our blog page.

ETL Data to Redshift, Bigquery, Snowflake

Move Data from any Source to Warehouse in Real-time

Sign up today to get $500 Free Credits to try Hevo!
Start Free Trial

Related Posts