Performance testing and optimization of PostgreSQL 11 similar image search plug-in imgsmlr 2-single machine partition table (dblink asynchronous call parallel) (400 million images)

Posted by synapp2 on Wed, 15 May 2019 10:27:10 +0200

Label

PostgreSQL, imgsmlr, GiST, image eigenvalues, wavelet transform

background

Continued

"PostgreSQL 11 Similar Image Search Plug-in imgsmlr Performance Test 1 - Single Table (400 million images)"

Parallel invocation using partition table + dblink asynchronous interface. (It's better to support imgsmlr gist index scan directly at the kernel level)

Partition Table + dblink Asynchronous Call Parallel

1. Create partition tables

create table t_img (id int primary key, sig signature) partition by hash (id);    

2. Create 64 partitions

do language plpgsql $$  
declare  
  i int;  
begin  
  for i in 0..63   
  loop  
    execute format('create table t_img%s partition of t_img for values WITH (MODULUS 64, REMAINDER %s)', i, i);   
  end loop;  
end;  
$$;  

3. Creating the Index of Image Eigenvalue Fields

create index idx_t_img_1 on t_img using gist(sig);  

4. Write 400 million random image eigenvalues

vi test.sql  
\set id random(1,2000000000)  
insert into t_img values (:id, gen_rand_img_sig(10)) on conflict(id) do nothing;  
pgbench -M prepared -n -r -P 1 -f ./test.sql -c 64 -j 64 -t 10000000  

dblink asynchronous call encapsulation

1. Create dblink plug-ins

create extension if not exists dblink;      

2. Create a connection function that does not report errors.

create or replace function conn(        
  name,   -- dblink Name        
  text    -- Connection string,URL        
) returns void as $$          
declare          
begin          
  perform dblink_connect($1, $2);         
  return;          
exception when others then          
  return;          
end;          
$$ language plpgsql strict;      

3. Write a function with input parameters of partition number and image eigenvalues. Open 64 parallel searches for each partition at the same time and return a most similar image record.

create or replace function parallel_img_search(  
  v_mod int,   -- Partition number  
  v_sig signature,  -- Image eigenvalue  
  conn text default format('hostaddr=%s port=%s user=%s dbname=%s application_name=', '127.0.0.1', current_setting('port'), current_user, current_database())  -- dblink Connect  
)  
returns setof record as  
$$  
declare  
  app_prefix text := 'abc';     
  sql text;  
  ts1 timestamp;  
begin  
  for i in 0..v_mod loop  
    perform conn(app_prefix||i,  conn||app_prefix||i);   
    perform id,sig from dblink_get_result(app_prefix||i, false) as t(id int, sig signature);   
    sql := format('select * from t_img%s order by sig <-> %L limit 1', i, v_sig);  
    perform dblink_send_query(app_prefix||i, sql);     
  end loop;  
  
  ts1 := clock_timestamp();  
  for i in 0..v_mod loop  
    return query select id,sig from dblink_get_result(app_prefix||i, false) as t(id int, sig signature);    
  end loop;  
  raise notice '%', clock_timestamp()-ts1;  
    
  return;  
end;  
$$ language plpgsql strict;  

4. Create a stable function to generate random image eigenvalues.

create or replace function get_rand_img_sig(int) returns signature as $$  
  select ('('||rtrim(ltrim(array(select (random()*$1)::float4 from generate_series(1,16))::text,'{'),'}')||')')::signature;  
$$ language sql strict stable;  

Example

postgres=# select get_rand_img_sig(10);  
                                                                         get_rand_img_sig                                                                           
------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 (3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)  
(1 row)  
  
Time: 0.345 ms  

5. Write about 298 million image eigenvalues.

postgres=# select count(*) from t_img;  
   count     
-----------  
 297915819  
(1 row)  

Asynchronous invocation of parallel queries for 64 partitions using dblink

Using dblink asynchronous call interface, query all partitions, time-consuming: 394 milliseconds

postgres=# select * from  parallel_img_search(63, '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)'::signature) as t (id int, sig signature) order by sig <-> '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)'::signature limit 1;  
  
  
NOTICE:  00:00:00.394257  
     id     |                                                                               sig                                                                                  
------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 1918283556 | (3.122560, 2.748080, 1.133250, 5.426950, 6.626340, 6.876810, 7.959190, 0.798523, 8.638600, 5.075110, 1.366100, 0.899454, 2.980070, 4.580630, 0.986704, 1.582110)  
(1 row)  
  
Time: 741.161 ms  

Direct query of a single partition takes 238 milliseconds

postgres=# explain (analyze,verbose,timing,costs,buffers) select sig from t_img48 order by sig <-> '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)' limit 1;  
                                                                                                    QUERY PLAN                                                                                                       
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.36..0.37 rows=1 width=72) (actual time=231.287..231.288 rows=1 loops=1)  
   Output: id, sig, ((sig <-> '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)'::signature))  
   Buffers: shared hit=11881  
   ->  Index Scan using t_img48_sig_idx on public.t_img48  (cost=0.36..41619.32 rows=4466603 width=72) (actual time=231.285..231.285 rows=1 loops=1)  
         Output: id, sig, (sig <-> '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)'::signature)  
         Order By: (t_img48.sig <-> '(3.970030, 2.340900, 0.946223, 5.951010, 6.560340, 7.922950, 6.646290, 0.430310, 7.690120, 5.799870, 1.337850, 1.319830, 3.178170, 6.439380, 0.925341, 2.215810)'::signature)  
         Buffers: shared hit=11881  
 Planning Time: 0.060 ms  
 Execution Time: 237.818 ms  
(9 rows)  
  
Time: 238.242 ms  

Compared with the first document: 439 million images in a single table, it takes 4.2 seconds to search for images. Using dblink asynchronous interface (64 parallel, 298 million) to search for images takes 394 milliseconds, which improves the performance greatly.

Summary

Using dblink asynchronous invocation, it did not reach 238 milliseconds, but 394 milliseconds.

After using dblink asynchronous invocation, the indexed data is about 15 GB per second.

postgres=# select pg_size_pretty(11881*64*8192::numeric/0.394);  
 pg_size_pretty   
----------------  
 15 GB  
(1 row)  
  
Time: 0.258 ms  

The bottleneck may be on memory COPY.

Next, let's take a look at the use of citus multicomputers.

Reference resources

https://github.com/postgrespro/imgsmlr

PostgreSQL Similar Search Plug-ins Introduction Summary (rum,pg_trgm,smlar,imgsmlr,pg_similarity) (rum,gin,gist))

PostgreSQL dblink Asynchronous Call Practice, Running Parallel Multitasks - For example, Opening N Parallel Background Tasks to Create Index, Opening N Background Tasks to Run Several SQL

Topics: Database SQL PostgreSQL github