Press "Enter" to skip to content

Oracle to Redshift ETL – Steps to Move Data

Oracle to Redshift ETL

Is your Oracle server getting too slow for analytical queries now? Or do you think you are paying too much money to increase the storage capacity or compute power of your Oracle instance? Or are you looking to join and combine data from multiple databases seamlessly? Whatever the case may be, Amazon Redshift offers amazing solutions to the above problems.  Hence there is little to think before moving your data from an Oracle to Amazon Redshift cluster. This post covers the basic idea behind the two architectures and the detailed steps you need to follow to migrate data from Oracle to Redshift. Additionally, it also covers why you should consider implementing an ETL solution such as Hevo Data to make the migration smooth and efficient.

Overview of Oracle and Amazon Redshift

Oracle is fundamentally a proprietary, multi-model, relational database system used for data warehousing and Online Transaction Processing (OLTP). However, the most recent versions include features similar to cloud-based solutions (such as Amazon Redshift) like columnar storage, on-cloud deployment, etc.

Amazon Redshift is a PostgreSQL standard based, efficiently scalable, entirely managed, on-cloud database optimized for Online Analytical Processing (OLAP) and data warehousing. One can get things started very quickly in just two steps –  

  1. Launch a Redshift cluster via simple API calls or through the AWS Management Console.
  2. Connect the local SQL client to this Redshift instance.

There are many advantages of Redshift’s unique architecture. Deciding to move data from Oracle to Redshift is the right step in stepping up your data analytics infrastructure.

Performing ETL from Oracle to Redshift

There are majorly two ways of moving data from Oracle to Redshift. The two ways are:

Method 1: Using a no-code, fully managed Data Pipeline Platform such as Hevo (comes with a 14-day free trial)

Method 2: Hand code ETL scripts and configure jobs to move Oracle data to Redshift

In this post, we will cover Method 2 in detail, list its limitations and discuss workarounds to address the same.

Moving Data from Oracle to Redshift Using Custom Code

A really easy and straightforward way to move data from Oracle to Amazon Redshift is by:

  1. Having the Oracle source data exported to flat files in the local system.
  2. Move this data onto AWS S3 buckets, afterwards.
  3. Finally, copy this S3 data into the respective Amazon Redshift tables.

These steps are illustrated in technical detail via an example in the following section.

Step 1: Exporting data from an Oracle table via Spool

One of the most common ways to export Oracle data onto a flat file is using the spool command. Here’s an example on how to do it –

SPOOL c:\oracle\org\emp.csv

SELECT employeeno || ',' ||

       employeename || ',' ||

       job || ',' ||

       manager || ',' ||

       TO_CHAR(hiredate,'YYYY-MM-DD') AS hiredate || ',' ||

       salary || ',' ||

FROM   employee

ORDER BY employeeno;

SPOOL OFF

The above code exports all records available in employee into the emp.csv file under the org folder as mentioned. The CSV file could then be zipped (using ‘$ gzip emp.csv‘ ) for compression before moving to the S3 bucket.

Step 2: Copying a flat file onto an AWS S3 bucket

AWS provides S3 buckets to store files that could be loaded into an Amazon Redshift instance using the COPY command. To drop a local file into an S3 bucket, you could run a ‘copy command’ on the AWS Command Line Interface. Here’s how you would do it –

aws s3 cp //oracle/org/emp.csv s3://org/empl/emp.csv.gz

However, if you’d prefer the Graphical User Interface (GUI) way, you could go over to your AWS S3 console https://console.aws.amazon.com/s3/home, and copy-paste your emp.csv file into the desired S3 bucket.

Step 3: Creating an empty table and loading data from the S3 bucket

Before running the COPY command, an empty table must be created in the database to absorb the emp.csv file now available on the S3 bucket.

The employee table on Redshift can be created using the following code –

SET SEARCH_PATH TO PUBLIC;       // selecting the schema

CREATE TABLE EMPLOYEE (

  cmployeeno INTEGER NOT NULL, 

  employeename VARCHAR,

  job VARCHAR,

  manager VARCHAR,

  hiredate DATE,  

  salary INTEGER

  DISTKEY(hiredate)

  SORTKEY(employeeno)

)

The flat file copied over to S3 can be loaded into the above table using the following –

SET SEARCH_PATH TO PUBLIC; 

COPY EMPLOYEE

FROM 's3://org/empl/emp.csv.gz'

'AWS_ACCESS_KEY_ID=MY_ACCESS_KEY AWS_SECRET_ACCESS_KEY=MY_SECRET_KEY'

GZIP;

Incremental Load from Oracle to Redshift

The above is a one-off example to demonstrate the process of moving data from Oracle to Redshift. In reality, this would be performed, typically every day, on an entire database consisting of tens or hundreds of tables in a scheduled and automated fashion. Here is how this is done.

Step 1: Iterative exporting of tables

The following script will go through each table one by one. Next, it will export the data in each of them into a separate CSV file with filename as the *name of the table*_s3.

begin

 for item in (select table_name from user_tables) 

  loop

    dbms_output.put_line('spool '||item.table_name||'_s3.csv');

    dbms_output.put_line('select * from'||item.table_name||’;’);  

    dbms_output.put_line('spool off');

  end loop;

end;

Step 2: Copying a bunch of .csv files to AWS S3

The exported .csv files can be uploaded to an S3 bucket using the following command –

aws s3 cp <your directory path> s3://<your bucket name> --grants read=uri=http://acs.amazonaws.com/groups/global/AllUsers --recursive

Step 3: Importing S3 data into Redshift

As mentioned before, this process is typically done every 24 hours on a whole lot of data. Hence, you must ensure that there is no data loss as well as no duplicate data. The second part (duplicate data) is particularly relevant when copying data over to Redshift because Redshift doesn’t enforce primary key constraints.

Now, you can drop all the data in your Redshift instance and load the entire Oracle database every time you are performing the data load. However, this is a) quite risky in regards to data loss b) very inefficient and computationally intensive. Hence, a good way to efficiently perform the data loads while ensuring data consistency would be to –

  1. Copy the S3 flat file data into a temp table: This is achieved by running the ‘COPY’ command the same way as explained in Step ‘3’ before.
  2. Compare the temp table data with the incoming data (the .csv files): See the section Data Loads – SCD Type 1 and Type
  3. Resolve any data inconsistency issues: See the section Data Loads – SCD Type 1 and Type 2
  4. Remove data from the parent table and copy the new and cleaned up data from the temp table: Run the following commands:
begin;

delete from employee where *condition* (depends on what data is available in the temp table)

insert into employee select * from emp_temp_table;

end;  

Data Loads – SCD Type 1 and Type 2

Generally, while comparing the existing (old) table data with the new stream of data (S3 bucket data, in this case) one or both of the following methods is used to complete the data load –

Type 1 or Upsert – A new record is either inserted or updated. The update happens only when the primary key of the incoming record matches with the primary key of an existing record. Here is an example –

Existing Record –

Employeename Employeenum Job
John 123 Sales

Incoming Record –

Employeename Employeenum Job
John 123 Operations

Final Table (After Upsert) –

Employeename Employeenum Job
John 123 Operations

Type 2 or Maintain history – In this scenario, if the primary key of the incoming record matches with the primary key of an existing record, the existing record is end dated or flagged to reflect that it is a past record. Here is the Type 2 for the above example –

Existing Record –

Employeename Employeenum Job
John 123 Sales

Incoming Record –

Employeename Employeenum Job
John 123 Operations

Final Table (After Type 2) –

Employeename Employeenum Job Activeflag
John 123 Operations 1
John 123 Sales 0

Limitations of writing Custom ETL Scripts to move data from Oracle to Redshift

Although a script (or more likely a combination of scripts) written to execute the above steps will work, it will be tedious to ensure smooth functioning of such a system due to the following reasons:

  1. There are many different kinds of steps that are needed to be executed in a dependent fashion without failure.
  2. The incremental load is especially difficult to code and execute in such a way as to ensure there is no data loss and/or data inconsistencies. Doing a full load every time puts a lot of load on the Oracle database. 
  3. As mentioned, this is typically done once every day. Lately, however, people want to look at more real-time data. Hence, this will have to be executed a lot more frequently than once in 24 hours. That is going to test the robustness and thoroughness of your solution a lot more.
AN EASIER WAY TO ETL DATA FROM ORACLE TO BIGQUERY:

A Data Pipeline Platform like Hevo Data that works out of the box is a much better option to ensure you are never at a disadvantage due to the above limitations. With Hevo, you can get your data from Oracle to Redshift in just a few minutes with no delay on setup or implementation time. Here are the steps to replicate Oracle DB to Amazon Redshift using Hevo:

  1. Connect and configure your Oracle database
  2. Select the replication mode:
    • Define data load through a custom SQL query
    • Select the Oracle DB tables to be loaded via interface
    • Load data through Redo logs
  3. Configure the Redshift Data Warehouse for data load

Furthermore, Hevo has an intuitive user interface that lets, even the not-so-technical people, easily tweak the parameters of your data load settings. This would come in super handy once you have everything up and running.

With Hevo, you can achieve seamless and accurate data replication from Oracle to Redshift. With its fault-tolerant architecture, Hevo ensures that no data is lost while loading. This empowers you to focus on the right projects instead of worrying about data availability.

Outside of Oracle, Hevo has over 100s of additional data sources that work out of the box. This makes Hevo the right companion to walk by your side as your organization grows.

Sign up for a 14-Day Free Trial with Hevo and experience code-free, hassle-free data loading from Oracle to Redshift.

ETL Data to Redshift, Bigquery, Snowflake

Move Data from any Source to Warehouse in Real-time

Sign up today to get $500 Free Credits to try Hevo!
Start Free Trial

Related Posts