Optimisation requête avec Group BY sur 600 000 lignes

**kimaidou** · 03/03/2011, 11h13

Bonjour à tous,

J'ai une table tbl_clients qui contient environ 600 000 lignes (nombre qui va probablement doubler d'ici 1 an)

Voici les infos complètes :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
 
-- Table: tbl_clients
 
-- DROP TABLE tbl_clients;
 
CREATE TABLE tbl_clients
(
  id_tbl_clients integer NOT NULL DEFAULT nextval(('seq_clients'::text)::regclass),
  ref_client character varying(8),
  title character varying(8),
  f_name character varying(30),
  l_name character varying(30),
  company character varying(50),
  address_1 character varying(50),
  address_2 character varying(50),
  address_3 character varying(50),
  postcode character varying(10),
  town character varying(30),
  country character varying(30),
  client_geolocal point NOT NULL DEFAULT '(0,0)'::point,
  siret character varying(15),
  cr_rating character varying(5),
  restriction character varying(15),
  access_id integer,
  contact_details character varying(110),
  date_created timestamp without time zone NOT NULL DEFAULT (('now'::text)::timestamp(6) with time zone)::timestamp(0) without time zone,
  created_by name NOT NULL DEFAULT "current_user"(),
  credit_terms smallint DEFAULT 0,
  the_geom geometry,
  usr_login character varying,
  usr_password character varying,
  usr_auth character varying,
  geo_adresse character varying,
  geo_codepostal integer,
  geo_ville character varying,
  geo_note character varying,
  geo_lat double precision,
  geo_lng double precision,
  parent_id integer,
  poi boolean,
  geo_status character varying,
  addr_numrue character varying,
  poi_name character varying(100),
  id_tbl_cat_poi integer,
  osm_id integer,
  usr_webcode character varying(20),
  statusnet_id bigint,
  addr_bister text,
  draft text,
  geo_numrue integer,
  geo_bister text,
  ad_properties bit(8) NOT NULL DEFAULT B'11111111'::"bit",
  message_properties bit(4) NOT NULL DEFAULT B'1111'::"bit",
  poi_properties bit(4) NOT NULL DEFAULT B'0000'::"bit",
  last_activity timestamp without time zone DEFAULT now(),
  CONSTRAINT enforce_dims_the_geom CHECK (ndims(the_geom) = 2),
  CONSTRAINT enforce_geotype_the_geom CHECK (geometrytype(the_geom) = 'POINT'::text OR the_geom IS NULL),
  CONSTRAINT enforce_srid_the_geom CHECK (srid(the_geom) = 4326)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE tbl_clients OWNER TO operators;
COMMENT ON TABLE tbl_clients IS 'sdfsdfsdf';
 
-- Index: inx_clients_company
 
-- DROP INDEX inx_clients_company;
 
CREATE INDEX inx_clients_company
  ON tbl_clients
  USING btree
  (l_name);
 
-- Index: inx_clients_idtblcatpoi
 
-- DROP INDEX inx_clients_idtblcatpoi;
 
CREATE INDEX inx_clients_idtblcatpoi
  ON tbl_clients
  USING btree
  (id_tbl_cat_poi);
 
-- Index: inx_clients_l_name
 
-- DROP INDEX inx_clients_l_name;
 
CREATE INDEX inx_clients_l_name
  ON tbl_clients
  USING btree
  (l_name);
 
-- Index: inx_clients_osm_id
 
-- DROP INDEX inx_clients_osm_id;
 
CREATE INDEX inx_clients_osm_id
  ON tbl_clients
  USING btree
  (osm_id);
 
-- Index: inx_clients_ref_client
 
-- DROP INDEX inx_clients_ref_client;
 
CREATE UNIQUE INDEX inx_clients_ref_client
  ON tbl_clients
  USING btree
  (ref_client);
 
-- Index: inx_clients_restriction
 
-- DROP INDEX inx_clients_restriction;
 
CREATE INDEX inx_clients_restriction
  ON tbl_clients
  USING btree
  (restriction);
 
-- Index: inx_clients_the_geom
 
-- DROP INDEX inx_clients_the_geom;
 
CREATE INDEX inx_clients_the_geom
  ON tbl_clients
  USING gist
  (the_geom);
 
-- Index: tbl_clients_statusnet_id_idx
 
-- DROP INDEX tbl_clients_statusnet_id_idx;
 
CREATE INDEX tbl_clients_statusnet_id_idx
  ON tbl_clients
  USING btree
  (statusnet_id);
 
 
-- Trigger: update_insert_clients on tbl_clients
 
-- DROP TRIGGER update_insert_clients ON tbl_clients;
 
CREATE TRIGGER update_insert_clients
  BEFORE INSERT OR UPDATE
  ON tbl_clients
  FOR EACH ROW
  EXECUTE PROCEDURE trigger_set_client_properties();

La colonne id_tbl_cat_poi est un entier qui est lié à la table tbl_cat_poi suivante (environ 80 lignes, bougera peu)

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
CREATE TABLE tbl_cat_poi
(
  id_tbl_cat_poi serial NOT NULL,
  parent_id smallint NOT NULL DEFAULT 0,
  item_level smallint NOT NULL DEFAULT 0,
  item_text character varying(50) NOT NULL DEFAULT ''::character varying,
  ad_properties bit(8) NOT NULL DEFAULT B'00011111'::"bit",
  message_properties bit(4) NOT NULL DEFAULT B'0001'::"bit",
  poi_properties bit(4) NOT NULL DEFAULT B'1111'::"bit",
  sort_order smallint NOT NULL DEFAULT 0,
  nb_to_display integer,
  min_distance integer,
  max_distance integer,
  CONSTRAINT tbl_cat_poi_pkey PRIMARY KEY (id_tbl_cat_poi)
)
WITH (
  OIDS=FALSE
);
ALTER TABLE tbl_cat_poi OWNER TO operators;

La requête suivante , qui récupère les informations de la catégorie et le nombre de client par catégorie, est gourmande en ressource

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
 
SELECT cat.item_text, cat.id_tbl_cat_poi, cat.parent_id, count(id_tbl_clients) as nombre 
FROM tbl_clients tc INNER JOIN tbl_cat_poi cat ON tc.id_tbl_cat_poi = cat.id_tbl_cat_poi 
WHERE 2>1 
AND tc.id_tbl_clients != 699317 
AND DISTANCE( TRANSFORM(tc.the_geom, 27582) , TRANSFORM(ST_SetSRID(ST_MakePoint(3.8853803239425,43.598034187644),4326), 27582) ) <= 5000 
AND ST_DWithin(tc.the_geom, ST_SetSRID(ST_MakePoint(3.8853803239425,43.598034187644),4326) ,0.4) 
AND tc.poi_properties & b'0000' = b'0000' 
GROUP BY cat.item_text, cat.id_tbl_cat_poi, cat.parent_id 
ORDER BY cat.item_text ;

Voici ce que donne l'explain :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
GroupAggregate  (cost=243345.49..243345.52 rows=1 width=27)
  ->  Sort  (cost=243345.49..243345.49 rows=1 width=27)
        Sort Key: cat.item_text, cat.id_tbl_cat_poi, cat.parent_id
        ->  Nested Loop  (cost=12608.71..243345.48 rows=1 width=27)
              Join Filter: (tc.id_tbl_cat_poi = cat.id_tbl_cat_poi)
              ->  Bitmap Heap Scan on tbl_clients tc  (cost=12608.71..243342.81 rows=1 width=8)
                    Recheck Cond: (the_geom && '0103000020E61000000100000005000000AFA34B140FE20B40E0B52B2F59994540AFA34B140FE20B40461C9295BFFF45400B0559BD3A241140461C9295BFFF45400B0559BD3A241140E0B52B2F59994540AFA34B140FE20B40E0B52B2F59994540'::geometry)
                    Filter: ((id_tbl_clients <> 699317) AND ('0101000020E6100000E2D67E4742150F4013E95E628CCC4540'::geometry && st_expand(the_geom, 0.4::double precision)) AND ((poi_properties & B'0000'::"bit") = B'0000'::"bit") AND _st_dwithin(the_geom, '0101000020E6100000E2D67E4742150F4013E95E628CCC4540'::geometry, 0.4::double precision) AND (distance(transform(the_geom, 27582), '0101000020BE6B00003558DB6BD8212641639D4E1DFD273C41'::geometry) <= 5000::double precision))
                    ->  Bitmap Index Scan on inx_clients_the_geom  (cost=0.00..12608.71 rows=338170 width=0)
                          Index Cond: (the_geom && '0103000020E61000000100000005000000AFA34B140FE20B40E0B52B2F59994540AFA34B140FE20B40461C9295BFFF45400B0559BD3A241140461C9295BFFF45400B0559BD3A241140E0B52B2F59994540AFA34B140FE20B40E0B52B2F59994540'::geometry)
              ->  Seq Scan on tbl_cat_poi cat  (cost=0.00..1.74 rows=74 width=23)

Lorsque je lance cette requête, elle prend environ 2 secondes avec un work_mem "optimisé" pour cette requête.
J'ai essayé d'optimiser via le work_meme : j'ai fait un test en ajoutant un
SET work_mem TO '10MB';
avant de lancer la requête. Voici les résultats en secondes : colonne 1 = valeur du work_mem, colonne 2 = millisecondes)

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

J'ai donc choisi la valeur 10MB.

shared_buffer est à 24MB

Je ne sais pas bien comment optimiser cette requête. Elle est importante car elle me sert à créer un arbre des catégories en PHP pour un site internet. Elle est donc utilisée souvent.

J'ai "pensé" au work_mem et au shared_buffer à force de parcourir les messages liés à l'optimisation sur le forum, mais je ne suis pas expert.

Sinon j'ai aussi fait un VACUUM, un ANALYSE, et même un REINDEX TABLE tbl_clients.

En espérant vous avoir donné les informations nécessaires pour que vous puissiez m'aider.

Kimaidou

**kimaidou** · 03/03/2011, 13h40

Je poste ici le résultat de l'explain analyse (qui prends 13 secondes, alors que la requête simple prend 2s)

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
 
GroupAggregate  (cost=243345.49..243345.52 rows=1 width=27) (actual time=13797.253..13955.479 rows=13 loops=1)
  ->  Sort  (cost=243345.49..243345.49 rows=1 width=27) (actual time=13797.233..13875.691 rows=68664 loops=1)
        Sort Key: cat.item_text, cat.id_tbl_cat_poi, cat.parent_id
        Sort Method:  quicksort  Memory: 8460kB
        ->  Nested Loop  (cost=12608.71..243345.48 rows=1 width=27) (actual time=68.167..13606.351 rows=68664 loops=1)
              Join Filter: (tc.id_tbl_cat_poi = cat.id_tbl_cat_poi)
              ->  Bitmap Heap Scan on tbl_clients tc  (cost=12608.71..243342.81 rows=1 width=8) (actual time=68.153..1549.290 rows=68664 loops=1)
                    Recheck Cond: (the_geom && '0103000020E61000000100000005000000AFA34B140FE20B40E0B52B2F59994540AFA34B140FE20B40461C9295BFFF45400B0559BD3A241140461C9295BFFF45400B0559BD3A241140E0B52B2F59994540AFA34B140FE20B40E0B52B2F59994540'::geometry)
                    Filter: ((id_tbl_clients <> 699317) AND ('0101000020E6100000E2D67E4742150F4013E95E628CCC4540'::geometry && st_expand(the_geom, 0.4::double precision)) AND ((poi_properties & B'0000'::"bit") = B'0000'::"bit") AND _st_dwithin(the_geom, '0101000020E6100000E2D67E4742150F4013E95E628CCC4540'::geometry, 0.4::double precision) AND (distance(transform(the_geom, 27582), '0101000020BE6B00003558DB6BD8212641639D4E1DFD273C41'::geometry) <= 5000::double precision))
                    ->  Bitmap Index Scan on inx_clients_the_geom  (cost=0.00..12608.71 rows=338170 width=0) (actual time=60.821..60.821 rows=212108 loops=1)
                          Index Cond: (the_geom && '0103000020E61000000100000005000000AFA34B140FE20B40E0B52B2F59994540AFA34B140FE20B40461C9295BFFF45400B0559BD3A241140461C9295BFFF45400B0559BD3A241140E0B52B2F59994540AFA34B140FE20B40E0B52B2F59994540'::geometry)
              ->  Seq Scan on tbl_cat_poi cat  (cost=0.00..1.74 rows=74 width=23) (actual time=0.002..0.086 rows=74 loops=68664)
Total runtime: 13957.221 ms

Je dois passer à côté de quelque chose, non ? Car je ne suis sûrement pas le seul à faire ce genre de requête très classique. Je ne comprends pas ce qui plombe les performances.

**SQLpro** · 05/03/2011, 01h19

Dès lors que vous utilisez des données et fonctions non relationnelles comme celles du spatial et à moins que le prédicat dans lequel figure les expressions spatiales soit "sargeable", vous ne pourrez prétendre à des performances.
Or ce n'est pas le cas de deux de vos prédicats :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

DISTANCE( TRANSFORM(tc.the_geom, 27582) , TRANSFORM(ST_SetSRID(ST_MakePoint(3.8853803239425,43.598034187644),4326), 27582) ) <= 5000

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

ST_DWithin(tc.the_geom, ST_SetSRID(ST_MakePoint(3.8853803239425,43.598034187644),4326) ,0.4)

retirez les et constatez !

A +

**estofilo** · 05/03/2011, 13h01

Quelques remarques:
- shared_buffers à 24Mb c'est vraiment peu. cet espace est alloué une seule fois au démarrage de postgres et on y met plutôt une valeur du quart à la moitié de la mémoire physique, suivant que le serveur est dédié au sgbd ou pas.

- cette ligne-là:

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

Nested Loop  (cost=12608.71..243345.48 rows=1 width=27) (actual time=68.167..13606.351 rows=68664 loops=1)

montre que l'optimiseur estime qu'il y a une seule ligne de résultat interrmédiaire alors qu'il y en a 68664. Il est possible que ça le mène à un mauvais choix de stratégie.

- si tu fais la même requête après avoir passé:

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

set enable_bitmapscan=false;

est-ce que les résultats sont meilleurs? (à exécuter 2 fois pour bénéficier du cache disque).
Ca va probablement provoquer un scan séquentiel sur la table tbl_clients ce qui est peut-être plus adapté, à voir.

Optimisation requête avec Group BY sur 600 000 lignes

Requêtes PostgreSQL

Discussions similaires

Partager

Partager