Label
PostgreSQL, imgsmlr, GiST, image eigenvalues, wavelet transform
background
Continued
Parallel invocation using partition table + dblink asynchronous interface. (It's better to support imgsmlr gist index scan directly at the kernel level)
Partition Table + dblink Asynchronous Call Parallel
1. Create partition tables
create table t_img (id int primary key, sig signature) partition by hash (id);
2. Create 64 partitions
do language plpgsql $$ declare i int; begin for i in 0..63 loop execute format('create table t_img%s partition of t_img for values WITH (MODULUS 64, REMAINDER %s)', i, i); end loop; end; $$;
3. Creating the Index of Image Eigenvalue Fields
create index idx_t_img_1 on t_img using gist(sig);
4. Write 400 million random image eigenvalues
vi test.sql \set id random(1,2000000000) insert into t_img values (:id, gen_rand_img_sig(10)) on conflict(id) do nothing;
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -t 10000000
dblink asynchronous call encapsulation
1. Create dblink plug-ins
create extension if not exists dblink;
2. Create a connection function that does not report errors.
create or replace function conn( name, -- dblink Name text -- Connection string,URL ) returns void as $$ declare begin perform dblink_connect($1, $2); return; exception when others then return; end; $$ language plpgsql strict;
3. Write a function with input parameters of partition number and image eigenvalues. Open 64 parallel searches for each partition at the same time and return a most similar image record.
create or replace function parallel_img_search( v_mod int, -- Partition number v_sig signature, -- Image eigenvalue conn text default format('hostaddr=%s port=%s user=%s dbname=%s application_name=', '127.0.0.1', current_setting('port'), current_user, current_database()) -- dblink Connect ) returns setof record as $$ declare app_prefix text := 'abc'; sql text; ts1 timestamp; begin for i in 0..v_mod loop perform conn(app_prefix||i, conn||app_prefix||i); perform id,sig from dblink_get_result(app_prefix||i, false) as t(id int, sig signature); sql := format('select * from t_img%s order by sig <-> %L limit 1', i, v_sig); perform dblink_send_query(app_prefix||i, sql); end loop; ts1 := clock_timestamp(); for i in 0..v_mod loop return query select id,sig from dblink_get_result(app_prefix||i, false) as t(id int, sig signature); end loop; raise notice '%', clock_timestamp()-ts1; return; end; $$ language plpgsql strict;
4. Create a stable function to generate random image eigenvalues.
create or replace function get_rand_img_sig(int) returns signature as $$ select ('('||rtrim(ltrim(array(select (random()*$1)::float4 from generate_series(1,16))::text,'{'),'}')||')')::signature; $$ language sql strict stable;
Example
postgres=# select get_rand_img_sig(10); get_rand_img_sig ------------------------------------------------------------------------------------------------------------------------------------------------------------------ (3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810) (1 row) Time: 0.345 ms
5. Write about 298 million image eigenvalues.
postgres=# select count(*) from t_img; count ----------- 297915819 (1 row)
Asynchronous invocation of parallel queries for 64 partitions using dblink
Using dblink asynchronous call interface, query all partitions, time-consuming: 394 milliseconds
postgres=# select * from parallel_img_search(63, '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)'::signature) as t (id int, sig signature) order by sig <-> '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)'::signature limit 1; NOTICE: 00:00:00.394257 id | sig ------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------ 1918283556 | (3.122560, 2.748080, 1.133250, 5.426950, 6.626340, 6.876810, 7.959190, 0.798523, 8.638600, 5.075110, 1.366100, 0.899454, 2.980070, 4.580630, 0.986704, 1.582110) (1 row) Time: 741.161 ms
Direct query of a single partition takes 238 milliseconds
postgres=# explain (analyze,verbose,timing,costs,buffers) select sig from t_img48 order by sig <-> '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)' limit 1; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.36..0.37 rows=1 width=72) (actual time=231.287..231.288 rows=1 loops=1) Output: id, sig, ((sig <-> '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)'::signature)) Buffers: shared hit=11881 -> Index Scan using t_img48_sig_idx on public.t_img48 (cost=0.36..41619.32 rows=4466603 width=72) (actual time=231.285..231.285 rows=1 loops=1) Output: id, sig, (sig <-> '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)'::signature) Order By: (t_img48.sig <-> '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)'::signature) Buffers: shared hit=11881 Planning Time: 0.060 ms Execution Time: 237.818 ms (9 rows) Time: 238.242 ms
Compared with the first document: 439 million images in a single table, it takes 4.2 seconds to search for images. Using dblink asynchronous interface (64 parallel, 298 million) to search for images takes 394 milliseconds, which improves the performance greatly.
Summary
Using dblink asynchronous invocation, it did not reach 238 milliseconds, but 394 milliseconds.
After using dblink asynchronous invocation, the indexed data is about 15 GB per second.
postgres=# select pg_size_pretty(11881*64*8192::numeric/0.394); pg_size_pretty ---------------- 15 GB (1 row) Time: 0.258 ms
The bottleneck may be on memory COPY.
Next, let's take a look at the use of citus multicomputers.