java - Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time? -

- February 15, 2015

i have csv is... 34 million lines long. yes, no joking.

this csv file produced parser tracer imported corresponding debugging program.

and problem in latter.

right import rows 1 one:

private void insertnodes(final dslcontext jooq)     throws ioexception {     try (         final stream<string> lines = files.lines(nodespath, utf8);     ) {         lines.map(csvtonode)             .peek(ignored -> status.incrementprocessednodes())             .foreach(r -> jooq.insertinto(nodes).set(r).execute());     } }

csvtonode mapper turn string (a line of csv) nodesrecord insertion.

now, line:

            .peek(ignored -> status.incrementprocessednodes())

well... method name tells pretty everything; increments counter in status reflects number of rows processed far.

what happens status object queried every second information status of loading process (we talking 34 million rows here; take 15 minutes load).

but jooq has (taken documentation) can load directly csv:

create.loadinto(author)       .loadcsv(inputstream)       .fields(id, author_id, title)       .execute();

(though i'd never use .loadcsv() overload since doesn't take csv encoding account).

and of course jooq manage turn suitable construct or db engine throughput maximized.

the problem lose "by second" information current code... , if replace query select count(*) the_victim_table, kind of defeats point, not mention may take long time.

so, how "the best of both worlds"? is, there way use "optimized csv load" , query, enough , @ time, how many rows have been inserted far?

(note: should matter, use h2; postgresql version planned)

there number of ways optimise this.

custom load partitioning

one way optimise query execution @ side collect sets of values into:

bulk statements (as in insert t values(1), (2), (3), (4))
batch statements (as in jdbc batch)
commit segments (commit after n statements)

... instead of executing them 1 one. loader api (see below). of these measures can heavily increase load speed.

this way can "listen" loading progress.

load partitioning using jooq 3.6+

(this hasn't been released yet, be, soon)

jooq natively implements above 3 partitioning measures in jooq 3.6

using vendor-specific csv loading mechanisms

jooq need pass through jdbc , might not present fastest option. databases have own loading apis, e.g. ones you've mentioned:

this more low-level, faster else.

general remarks

what happens status object queried every second information status of loading process (we talking 34 million rows here; take 15 minutes load).

that's interesting idea. register feature request loader api: using jooq "batch insert" csv _and_ keep track of inserted records @ same time?

though i'd never use .loadcsv() overload since doesn't take csv encoding account

we've fixed jooq 3.6, remarks: https://github.com/jooq/jooq/issues/4141

and of course jooq manage turn suitable construct or db engine throughput maximized.

no, jooq doesn't make assumptions maximising throughput. extremely difficult , depends on many other factors db vendor, e.g.:

constraints on table
indexes on table
logging turned on/off
etc.

jooq offers in maximising throughput yourself. instance, in jooq 3.5+, can:

set commit rate (e.g. commit every 1000 rows) avoid long undo / redo logs in case you're inserting logging turned on. can done via commitxxx() methods.

in jooq 3.6+, can also:

set bulk statement rate (e.g. combine 10 rows in single statement) drastically speed execution. can done via bulkxxx() methods.
set batch statement rate (e.g. combine 10 statements in single jdbc batch) drastically speed execution (see blog post details). can done via batchxxx() methods.

Search This Blog

Sort