Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data between HDFS (Hadoop storage) and RDBMS (relational database) servers like SQLite, Oracle, MySQL, Netezza, Teradata, Postgres etc. Apache Sqoop imports data from relational databases to HDFS, and exports data from HDFS to relational databases. It efficiently transfers bulk data between Hadoop and external data stores such as enterprise data warehouses, relational databases, etc.
This is how Sqoop got its name – “SQL to Hadoop & Hadoop to SQL”.
Additionally, Sqoop is used to import data from external datastores into Hadoop ecosystem’s tools like Hive & HBase.
Why Sqoop?
When the data residing in the relational database management systems need to be transferred to HDFS. The task of writing MapReduce code for importing and exporting data from the relational database to HDFS is uninteresting & tedious. This is where Apache Sqoop comes to rescue and removes their pain. It automates the process of importing & exporting the data.
Sqoop makes the life of developers easy by providing CLI for importing and exporting data. They just have to provide basic information like database authentication, source, destination, operations etc. It takes care of the remaining part.
Sqoop internally converts the command into MapReduce tasks, which are then executed over HDFS. It uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.
Mostly two sqoop commands are used by programmers which are sqoop import and export. Let’s discuss them one by one.
SQOOP IMPORT Command
Sqoop import command imports a table from an RDBMS to HDFS. In our case, we are going to import tables from MySQL databases to HDFS.
So, as you can see, if more than 1 map task (-m 2) is used in the import command than table should have a primary key or specify –split-by in your command.
There are lots of other import arguments available which you can use as per your requirement.
SQOOP ExportCommand
The sqoop export command exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.
The default operation is to transform these into a set of INSERT statements that inject the records into the database. In “update mode,” Sqoop will generate UPDATE statements that replace existing records in the database, and in “call mode” Sqoop will make a stored procedure call for each record.
Anchor column to use for updates. Use a comma separated list of columns if there are more than one column.
--update-mode <mode>
Specify how updates are performed when new rows are found with non-matching keys in database.
Legal values for mode include updateonly (default) and allowinsert.
--input-null-string <null-string>
The string to be interpreted as null for string columns
--input-null-non-string <null-string>
The string to be interpreted as null for non-string columns
--staging-table <staging-table-name>
The table in which data will be staged before being inserted into the destination table.
--clear-staging-table
Indicates that any data present in the staging table can be deleted.
--batch
Use batch mode for underlying statement execution.
To import or export, the order of columns in both MySQL and Hive should be the same.
I hope you have enjoyed this post and it helped you to understand in sqoop import and export command. Please like and share and feel free to comment if you have any suggestions or feedback.
In continuation of the previous blog Part-1, I am going to mention some other Hive changes that have been introduced in the Hadoop CDP (Cloudera Data Platform) upgrade. Beeline Hive. read more…
INTRODUCTION Cloudera Data Platform (CDP) is a cloud computing platform for businesses. It provides integrated and multifunctional self-service tools in order to analyze and centralize data. It brings security and governance. read more…
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here:
Cookie Policy
Pradeep Mishra
Share post:
Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data between HDFS (Hadoop storage) and RDBMS (relational database) servers like SQLite, Oracle, MySQL, Netezza, Teradata, Postgres etc. Apache Sqoop imports data from relational databases to HDFS, and exports data from HDFS to relational databases. It efficiently transfers bulk data between Hadoop and external data stores such as enterprise data warehouses, relational databases, etc.
Additionally, Sqoop is used to import data from external datastores into Hadoop ecosystem’s tools like Hive & HBase.
Why Sqoop?
When the data residing in the relational database management systems need to be transferred to HDFS. The task of writing MapReduce code for importing and exporting data from the relational database to HDFS is uninteresting & tedious. This is where Apache Sqoop comes to rescue and removes their pain. It automates the process of importing & exporting the data.
Sqoop makes the life of developers easy by providing CLI for importing and exporting data. They just have to provide basic information like database authentication, source, destination, operations etc. It takes care of the remaining part.
Sqoop internally converts the command into MapReduce tasks, which are then executed over HDFS. It uses YARN framework to import and export the data, which provides fault tolerance on top of parallelism.
Mostly two sqoop commands are used by programmers which are sqoop import and export. Let’s discuss them one by one.
SQOOP IMPORT Command
Sqoop import command imports a table from an RDBMS to HDFS. In our case, we are going to import tables from MySQL databases to HDFS.
As you can see in the below image, we have student table in the mydb database which we will be importing into HDFS.
The command for importing table is:
sqoop import --connect jdbc:mysql://localhost/mydb --username root --password root --table root -m 1 --target-dir /sqoop_import
_01As you can see in the below images, after executing this command SQL statement and Map tasks will be executed at the back end.
After the command is executed, you can check the HDFS target folder (in our case
sqoop_import
_01) where the data is imported.Import control arguments:
--append
--as-avrodatafile
--as-sequencefile
--as-textfile
--as-parquetfile
--boundary-query <statement>
--columns <col,col,col…>
--delete-target-dir
--direct
--fetch-size <n>
--inline-lob-limit <n>
-m,--num-mappers <n>
-e,--query <statement>
statement
.--split-by <column-name>
--autoreset-to-one-mapper
option.--split-limit <n>
--autoreset-to-one-mapper
--split-by <col>
option.--table <table-name>
--target-dir <dir>
--temporary-rootdir <dir>
--warehouse-dir <dir>
--where <where clause>
-z,--compress
--compression-codec <c>
--null-string <null-string>
--null-non-string <null-string>
Now, try to increase the no. of map task from 1 to 2 in the query to run it in 2 parallel threads and let’s see the result.
So, as you can see, if more than 1 map task (-m 2) is used in the import command than table should have a primary key or specify –split-by in your command.
There are lots of other import arguments available which you can use as per your requirement.
SQOOP Export Command
The sqoop export command exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.
The default operation is to transform these into a set of
INSERT
statements that inject the records into the database. In “update mode,” Sqoop will generateUPDATE
statements that replace existing records in the database, and in “call mode” Sqoop will make a stored procedure call for each record.So, first we are creating an empty table, where we will export our data.
The command to export data from HDFS to the relational database is:
Export control arguments:
--columns <col,col,col…>
--direct
--export-dir <dir>
-m,--num-mappers <n>
--table <table-name>
--call <stored-proc-name>
--update-key <col-name>
--update-mode <mode>
mode
includeupdateonly
(default) andallowinsert
.--input-null-string <null-string>
--input-null-non-string <null-string>
--staging-table <staging-table-name>
--clear-staging-table
--batch
I hope you have enjoyed this post and it helped you to understand in sqoop import and export command. Please like and share and feel free to comment if you have any suggestions or feedback.
Share this:
Like this:
Apache Hive 3 Changes in CDP Upgrade: Part-2
In continuation of the previous blog Part-1, I am going to mention some other Hive changes that have been introduced in the Hadoop CDP (Cloudera Data Platform) upgrade. Beeline Hive. read more…
Share this:
Like this:
Continue Reading
Apache Hive 3 Changes in CDP Upgrade: Part-1
INTRODUCTION Cloudera Data Platform (CDP) is a cloud computing platform for businesses. It provides integrated and multifunctional self-service tools in order to analyze and centralize data. It brings security and governance. read more…
Share this:
Like this:
Continue Reading