java - Using JooQ to "batch insert" from a CSV _and_ keep track of inserted records at the same time? -
i have csv is... 34 million lines long. yes, no joking.
this csv file produced parser tracer imported corresponding debugging program.
and problem in latter.
right import rows 1 one:
private void insertnodes(final dslcontext jooq) throws ioexception { try ( final stream<string> lines = files.lines(nodespath, utf8); ) { lines.map(csvtonode) .peek(ignored -> status.incrementprocessednodes()) .foreach(r -> jooq.insertinto(nodes).set(r).execute()); } }
csvtonode
mapper turn string
(a line of csv) nodesrecord
insertion.
now, line:
.peek(ignored -> status.incrementprocessednodes())
well... method name tells pretty everything; increments counter in status
reflects number of rows processed far.
what happens status
object queried every second information status of loading process (we talking 34 million rows here; take 15 minutes load).
but jooq has (taken documentation) can load directly csv:
create.loadinto(author) .loadcsv(inputstream) .fields(id, author_id, title) .execute();
(though i'd never use .loadcsv()
overload since doesn't take csv encoding account).
and of course jooq manage turn suitable construct or db engine throughput maximized.
the problem lose "by second" information current code... , if replace query select count(*) the_victim_table
, kind of defeats point, not mention may take long time.
so, how "the best of both worlds"? is, there way use "optimized csv load" , query, enough , @ time, how many rows have been inserted far?
(note: should matter, use h2; postgresql version planned)
there number of ways optimise this.
custom load partitioning
one way optimise query execution @ side collect sets of values into:
- bulk statements (as in
insert t values(1), (2), (3), (4)
) - batch statements (as in jdbc batch)
- commit segments (commit after n statements)
... instead of executing them 1 one. loader
api (see below). of these measures can heavily increase load speed.
this way can "listen" loading progress.
load partitioning using jooq 3.6+
(this hasn't been released yet, be, soon)
jooq natively implements above 3 partitioning measures in jooq 3.6
using vendor-specific csv loading mechanisms
jooq need pass through jdbc , might not present fastest option. databases have own loading apis, e.g. ones you've mentioned:
- h2: http://www.h2database.com/html/tutorial.html#csv
- postgresql: http://www.postgresql.org/docs/current/static/sql-copy.html
this more low-level, faster else.
general remarks
what happens status object queried every second information status of loading process (we talking 34 million rows here; take 15 minutes load).
that's interesting idea. register feature request loader
api: using jooq "batch insert" csv _and_ keep track of inserted records @ same time?
though i'd never use .loadcsv() overload since doesn't take csv encoding account
we've fixed jooq 3.6, remarks: https://github.com/jooq/jooq/issues/4141
and of course jooq manage turn suitable construct or db engine throughput maximized.
no, jooq doesn't make assumptions maximising throughput. extremely difficult , depends on many other factors db vendor, e.g.:
- constraints on table
- indexes on table
- logging turned on/off
- etc.
jooq offers in maximising throughput yourself. instance, in jooq 3.5+, can:
- set commit rate (e.g. commit every 1000 rows) avoid long undo / redo logs in case you're inserting logging turned on. can done via
commitxxx()
methods.
in jooq 3.6+, can also:
- set bulk statement rate (e.g. combine 10 rows in single statement) drastically speed execution. can done via
bulkxxx()
methods. - set batch statement rate (e.g. combine 10 statements in single jdbc batch) drastically speed execution (see blog post details). can done via
batchxxx()
methods.
Comments
Post a Comment