Sqoop import : composite primary key and textual primary key

Specify split column manually. Split column is not necessarily equal to PK. You can have complex PK and some int Split column. You can specify any integer column or even simple function (some simple function like substring or cast, not aggregation or analytic). Split column preferably should be evenly distributed integer.

For example if your split column contains few rows with value -1 and 10M rows with values 10000 – 10000000 and num-mappers=8 then sqoop will split the dataset between mappers not evenly:

  • 1st mapper will get few rows with -1,
  • 2nd-7th mappers will get 0 rows,
  • 8th mapper will get almost 10M rows,

that will result in data skewing and 8th mapper will run for ever or
even fail. And I have also got duplicates when used non-integer
split-column with MS-SQL
. So, use integer split-column. In your case
with table with only two varchar columns you can either

(1) add surrogate int PK and use it also as a split or

(2) split your data manually using custom query with WHERE clause and run sqoop few times with num-mappers=1, or

(3) apply some deterministic Integer non-aggregation function to you varchar column, for example cast(substr(…) as int) or second(timestamp_col) or datepart(second, date), etc. as split-column.
For Teradata you can use AMP number: HASHAMP (HASHBUCKET (HASHROW (string_column_list))) to get integer AMP number from list of not integer keys and rely on TD distribution between AMPs. I used simple functions directly as split-by without adding it to the query as a derived column

Leave a Comment