regex et xml

Version imprimable

07/07/2011, 14h01
rambc

regex et xml

Bonjour,
j'aimerais savoir comment récupérer ce qui est entre <attribut valeur = "5"> et </attribut> dans <attribut valeur = "5">Du texte ou autre chose, peu importe...</attribut>.

J'imagine qu'il faut "matcher" des groupes.

Bien entendu, je ne cherche pas à faire un "parseur" de XML, mais à mieux comprendre les regex.
07/07/2011, 15h19
mont29

Héhéhé… Il se trouve que j’ai, il y a quelques années maintenant, implémenté un mini-parseur xml en regex (sous PHP, d’ailleurs…). :oops:

Alors, en fait, il y a plusieurs situations*:

* S’il ne peut pas y avoir d’autre élément <attribut> au sein de l’un d’eux, la regex est (très) simple, ’suffit de parcourir le contenu en mode non-glouton, quelque chose comme*:

Code:

"<attribut ?[^>]*>(.*?)</attribut>"

* Si les niveaux d’imbrication sont limités et restent raisonnables, ’suffit d’ajouter des imbrications optionnelles, c’est fastidieux mais pas trop compliqué non plus…

* Par contre, si on veut pouvoir gérer des niveaux d’imbrication quelconque… C’est pas possible en pur regex Python (en tout cas, à ma connaissance)*! En PHP, c’est possible, grâce à une fonctionnalité avancée, les regex récursives… Mais ça donne un code franchement indigeste*! Je pourrai retrouver ce que j’avais pondu à l’époque, si ça t’intéresse… :aie:
07/07/2011, 16h59
rambc

Merci pour la réponse à ma question imprécise car j'ai omis de préciser que je ne connais pas "attribut" à l'avance...
07/07/2011, 19h05
mont29

Ben, ça change pas grand chose, si*? Tu fais juste un .format() (ou un %… ) pour insérer le nom de ta balise dans le code de la regex, avant de la compiler*?

Ou alors, tu veux faire un truc genre «*je veux le contenu de la première balise rencontrée, quelle qu’elle soit*» –*dans ce cas, il faut effectivement avoir recours à une substitution*:

Code:

"<(?P<tag>\S+) ?[^>]*>(.*?)</(?P=tag)>"

Ici, j’ai utilisé un groupe nommé pour récupérer le nom du tag, mais un groupe anonyme aurait aussi bien pu faire l’affaire (le nom du tag est constitué de n’importe quoi sauf des espaces). :)
07/07/2011, 20h56
rambc

Merci.

Si je comprends bien, P<tag> permet de définir un nom que l'on réutilise ensuite dans P=tag.
07/07/2011, 21h40
mont29

C’est ça, (?P<tag>…) définit un groupe nommé tag, que tu peux ensuite réutiliser dans la regex, ou récupérer dans l’objet match (par .groupdict()).
08/07/2011, 01h49
rambc

Merci.

PS : peux-tu donner un exemple cour de regex récursive en PhP ?

Ben, désolé, je ne suis pas sûr qu’on puisse en donner un exemple simple… En voici une pour récupérer tout le contenu du bloc le plus externe, dans une chaîne du genre*:

Code:

"bla {lorem {ipsum} et {cae{tera}}} blablabla"

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<?php
$regex_simple = '/'
	  // The initial open {.
	. '\{'
		  // Condition: If it is not a closing }…
		. '(?P<content>(?(?!\})'
			  // Condition: If it is an opening {, try the (recursive) sub-pattern matching…
			. '(?(?=\{)'
				  // The start of the recursive sub-pattern.
				. '(?P<subpattern>'
					  // A nested element of same kind as first opening one,
					. '\{'
						  // Condition: If it is not a closing tag…
						. '(?(?!\})'
							  // Condition: If it is an opening one, try recursive pattern…
							. '(?(?=\})(?P>subpattern)'
							  // Else, consume as much chars as possible, until the next
							  // opening or closing tag…
							. '|((?:.(?!\{)(?!\}))*.))'
						  // And retest the “extern” condition.
						. ')*'
					  // The matching closing tag!
					. '\}'
				  // End of subpattern.
				. ')'
			  // Else, consume as much chars as possible, until the next
			  // opening or closing tag.
			. '|((?:.(?!\{)(?!\}))*.))'
		  // And retest the whole “extern” condition.
		. ')*)'
	  // Out-most closing tag.
	. '\}'
	. '/s';

En voici le synopsis*:

Code:

1
2
3
4
5
6
7
8
9
On teste/détecte le { ouvrant.
    Tant que la suite n’est pas le } fermant (genre {}) (look-ahead):
        Si la suite est l’ouverture d’un nouveau bloc ({):
            Partie récursive de la regex, qui, en boucle:
                * Consomme du contenu.
                * “S’appelle” dès qu’elle détecte un nouveau { (look-ahead).
                * “Retourne” dès qu’elle détecte un } (look-ahead).
        Sinon, on consomme le contenu, jusqu’au prochain { ou } (look-ahead).
Et on consomme le dernier }!

Pour info, voici ce que ça donnait pour du html ($tagname et $attributes sont deux sous-regex correspondant au tag et/ou attributs recherchés…)*:

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// This regex will “parse” the piece of html code to extract the elements
// we need. Note that it is not “fully armored” against strange or invalid html!
// This a VERY complex recursive regex (took me hours to make it work!).
// So lets detail it!
// Note: About opening elements: when no tag name is given, all opening
//       elements are tested. There seems to be a bug with elements
//       like “<hr />” (returns a white page, without any exception…).
//       So I added a negative look-ahead test to exclude these things
//       explicitly (even though their matching test should fail nicely!).
// WARNING: This seems to be a VERY sensible regex, prone to strange errors
//          very quickly…
$regex = '/'
	  // The initial open element.
	. '<(?P<tagname>'.$tagname.')(?! *\/>)'.$attributes.'\s*>'
		  // Condition: If it is not a closing tag…
		. '(?P<content>(?(?!<\/(?P=tagname)>)'
			  // Condition: If it is an opening tag, try the (recursive) sub-pattern matching…
			. '(?(?=<(?P=tagname)(?: [^>]*|)>)'
				  // The start of the recursive sub-pattern.
				. '(?P<subpattern>'
					  // A nested element of same kind as first opening one,
					. '<(?P=tagname)(?: [^>]*|)>'
						  // Condition: If it is not a closing tag…
						. '(?(?!<\/(?P=tagname)>)'
							  // Condition: If it is an opening one, try recursive pattern…
							. '(?(?=<(?P=tagname)(?: [^>]*|)>)(?P>subpattern)'
							  // Else, consume as much chars as possible, until the next
							  // opening or closing tag…
							. '|((?:.(?!<(?P=tagname)(?: [^>]*|)>)(?!<\/(?P=tagname)>))*.))'
						  // And retest the “extern” condition.
						. ')*'
					  // The matching closing tag!
					. '<\/(?P=tagname)>'//.*?'
				  // End of subpattern.
				. ')'
			  // Else, consume as much chars as possible, until the next
			  // opening or closing tag.
			. '|((?:.(?!<(?P=tagname)(?: [^>]*|)>)(?!<\/(?P=tagname)>))*.))'
		  // And retest the whole “extern” condition.
		. ')*)'
	  // Out-most closing tag.
	. '<\/(?P=tagname)>'
	. '/s';

:aie: :mouarf:

09/07/2011, 19h20
rambc

Merci.