Problème récupérations d'informations dans des mêmes balises html

**stansoad0108** · 02/04/2008, 16h40

Bonjour à tous,

Mon but est de récupérer des informations entre des balises via a un parser...(Mon parser fonctionne, aucun problème à ce niveau là).

Cependant j'ai remarqué que beaucoup de textes se situaient entre les balises <td>...</td>. Malheureusement ces informations nappartiennet pas toutes au meme sujet, je voudrais faire le tri !!!

Exemple (Ce n'est qu'une partie de mon programme, le reste fonctionne...)

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
sub start_rtn {
 
	my ($tag, $attr) = @_;
if ($tag =~ /^td$/){
	$flag = 6;
	}
 
	if ($tag =~ /^b$/){
		$flag = 7;
	}
 
	if ($tag =~ /^td$/){
		$flag = 8;
	}
if ($tag =~ /^td$/){
		$flag = 10;
	}
}
 
sub text_rtn {
 
my ($text) = @_;
   	$text =~ s/\n/ /g;
 
if($flag == 6 && (($text eq 'Intron') || ($text eq 'Exon') || ($text eq 'NA') || ($text =~ /^[0-9]' UTR*$/))){
		print "Feature : $text\n" ;
	}
 
	if($flag == 7 && ($text =~ /^[A-Z]{1}[a-z]{2}\/[A-Z]{1}[a-z]{2}$/ || $text =~ /^[A-Z]{1}[a-z]{2}$/)){
		print "Amino Acid Translation : $text\n";
	}
 
	if($flag == 8 && ($text =~ /^[A-Z]{1}[a-z]{2}\/[A-Z]{1}[a-z]{2}$/ || $text =~ /^[A-Z]{1}[a-z]{2}$/)){
		print "Amino Acid Translation : $text\n";
    	}
if($flag == 10 && ($text =~ /^[0-9]/ && $text ne '3\' UTR' && $text ne '5\' UTR')){
		print "Number of Chromosomes : $text\n";
	}
}
 
sub end_rtn {
     my ($tag) = @_;
if ($tag =~ /^\/td$/ && ($flag == 6 || $flag == 8 || $flag == 10|| $flag == 11)){
		$flag = 0;
	}

Mon problème est de ne pas savoir jongler avec les différentes balises "td" et les boucles "if" dans ma start_rtn, pour récupérer des informations pour chaque type de sujet...Merci pour vos commentaires

Si vous avez besoin d'autres précisions, et si vous voulez mon code tout entier aussi!!!

**stansoad0108** · 02/04/2008, 16h45

Je rajoute mon code en entier au cas où...

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
#!/usr/bin/perl
use strict; 
use warnings;
 
use HTML::Parser;
use LWP::Simple;
 
#Variables
my $baseurl = 'http://www.pharmgkb.org/views/reports/loadVariantReports.action?geneId=';
my $flag = 0;
 
die "usage: $0 role code\n"
    if @ARGV != 1;
 
my $code = $ARGV[0];
 
print "$code\n";
 
#Page URL où parser
my $url = $baseurl.$code;
my $page = get($url); 
print "$url\n";
 
#Tableau où sera ranger le menu :
my @tab_menu=@_;
 
#Parser
 
my $parser = HTML::Parser->new(start_h => [\&start_rtn,"tag, attr"],
                text_h => [\&text_rtn, "text"],
                end_h => [\&end_rtn, "tag"]
                );
 
sub start_rtn {
 
	my ($tag, $attr) = @_;
 
    	if ($tag =~ /^title$/){		#Balise de départ pour le titre de la page
        	$flag = 1;
   	}
 
	if($tag =~ /^th$/){		#Sert de balise de départ pour écrire le menu !
		$flag = 2;
	}
 
	if ($tag =~ /^a$/ 		
   	 	and defined $attr->{href}  
    		and $attr->{href} =~ /^\/redirect\.jsp\?p=http%3A%2F%2Fgenome\.ucsc\.edu%2Fcgi-bin%2FhgTracks%3Fposition%3D/ 
   		and defined $attr->{target} 
    		and $attr->{target} =~ /^offsite$/){    
  		$flag = 3;
	}
 
	if ($tag =~ /^a$/
		and defined $attr->{href}  
    		and $attr->{href} =~ /^\/redirect\.jsp\?p=http%3A%2F%2Fwww\.ncbi\.nlm\.nih\.gov%2Fentrez%2Fquery./ 
   		and defined $attr->{target} 
    		and $attr->{target} =~ /^offsite$/){    
  		$flag = 4;
	}
 
	if ($tag =~ /^a$/
		and defined $attr->{href}
		and $attr->{href} =~ /^\/views\/reports\/loadVariantReport/){
		$flag = 5;
	}   
 
	if ($tag =~ /^td$/){		#Balise servira pour récupérer les textes en rapport avec "Feature"
		$flag = 6;
	}
 
	if ($tag =~ /^b$/){
		$flag = 7;
	}
 
	if ($tag =~ /^td$/){		#Balise servira pour récupérer les textes en rapport avec "Amino Acid Translation"
		$flag = 8;
	}
 
	if ($tag =~ /^a$/ 
		and defined $attr->{href} 
		and $attr->{href} =~ /^\/views\/reports\/loadFrequencyInSampleSets/){
		$flag = 9;            
	}
 
	if ($tag =~ /^td$/){		#Balise servira pour récupérer les textes en rapport avec "Number of Chromosomes"
		$flag = 10;
	}
 
	if ($tag =~ /^td$/){		#Balise servira pour récupérer les textes en rapport avec "Assay types"
		$flag = 11;
	}
}	
 
sub text_rtn {
 
    	my ($text) = @_;
   	$text =~ s/\n/ /g;
 
	if($flag == 1){                
        	print "Le titre : $text \n\n";
    	}
 
	if($flag == 2){                
         	print "MENU : $text \n";
		push(@tab_menu, $text);	
    	}
 
	if($flag == 3){                
         	print "GP POSITION: $text \n";
    	}
 
	if($flag == 4){                
       		print "dbSNP Id: $text \n";
    	}
 
	if($flag == 5){
		print "Variant : $text\n";
    	}
 
	if($flag == 6 && (($text eq 'Intron') || ($text eq 'Exon') || ($text eq 'NA') || ($text =~ /^[0-9]' UTR$/))){
		print "Feature : $text\n" ;
	}
 
	if($flag == 7 && ($text =~ /^[A-Z]{1}[a-z]{2}\/[A-Z]{1}[a-z]{2}$/ || $text =~ /^[A-Z]{1}[a-z]{2}$/)){
		print "Amino Acid Translation : $text\n";
	}
 
	if($flag == 8 && ($text =~ /^[A-Z]{1}[a-z]{2}\/[A-Z]{1}[a-z]{2}$/ || $text =~ /^[A-Z]{1}[a-z]{2}$/)){
		print "Amino Acid Translation : $text\n";
    	}
 
	if($flag == 9){
		print "Frequency : $text\n";
    	}
 
	if($flag == 10 && ($text =~ /^[0-9]/ && $text ne '3\' UTR' && $text ne '5\' UTR')){
		print "Number of Chromosomes : $text\n";
	}
 
	if($flag==11 && ($text =~ /^[A-Z][^0-9]{3,}$/ && $text ne 'Exon' && $text ne 'Intron' && $text ne 'View')) {
		print "Assay types : $text\n";
    	}
}
 
sub end_rtn {
     my ($tag) = @_;
 
	if ($tag =~ /^\/title$/){
              $flag = 0;
	}
 
	if ($tag eq "/html") {
    		my $Text_menu = join(" \n ", @tab_menu);
    		print"===> MENU : \n $Text_menu\n";
	}
 
	if ($tag =~ /^\/th$/){
		$flag = 0;
	}
 
	if ($tag =~ /^\/a$/ && ($flag == 3 || $flag == 4 ||$flag == 5 || $flag == )){
               $flag = 0;
	}
 
	if ($tag =~ /^\/td$/ && ($flag == 6 || $flag == 8 || $flag == 10|| $flag == 11)){
		$flag = 0;
	}
 
	if ($tag =~ /^\/b$/ && ($flag == 7)){
		$flag = 0;
	}
 
}
 
#start parsing
$parser->parse($page);
 
#end parser
$parser ->eof;

**PerlPicker** · 02/04/2008, 17h33

Ton code n'est pas très facile à suivre avec les flags numériques!

As-tu vraiment besoin d'utiliser un parseur html? Récupérer toute la page avec LWP:Simple et l'analyser avec des expressions régulières est souvent plus simple et plus lisible. (avis très personnel!).

Bon courage

**stansoad0108** · 03/04/2008, 11h39

Bonjour,

J'ai trouvé une autre alternative pour mon problème, j'ai implanté plusieurs boucles "if" dans des boucles "if"...Cela FONCTIONNE

De plus j'ai tenté avec ta technique, PerlPicker, mais dur dur dur !!!
Merci quand même de vos aides ...

Je poste mon code au cas où :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
#!/usr/bin/perl
use strict; 
use warnings;
 
use HTML::Parser;
use LWP::Simple;
 
#Variables
my $baseurl = 'http://www.pharmgkb.org/views/reports/loadVariantReports.action?geneId=';
my $flag = 0;
my $red = 0;
my $url_GP_Postion;
 
die "usage: $0 role code\n"
    if @ARGV != 1;
 
my $code = $ARGV[0];
 
print "$code\n";
 
#Page URL où parser
my $url = $baseurl.$code;
my $page = get($url); 
print "$url\n";
 
#Tableau où sera ranger le menu :
my @tab_menu=@_;
 
#Parser
my $parser = HTML::Parser->new(start_h => [\&start_rtn,"tag, attr"],
                text_h => [\&text_rtn, "text"],
                end_h => [\&end_rtn, "tag"]
                );
 
#**************************************************************************************** 
sub start_rtn {
	my ($tag, $attr) = @_;
 
#$Flag = 1 (Titre)    	
	if ($tag =~ /^title$/){	
        	$flag = 1;
   	}
 
#$Flag = 2 (Menu)
	if($tag =~ /^th$/){		
		$flag = 2;
	}
 
#$Flag = 3 (GP Position)
	if ($tag =~ /^a$/ 		
		and defined $attr->{href}  
    		and $attr->{href} =~ /^\/redirect\.jsp\?p=http%3A%2F%2Fgenome\.ucsc\.edu%2Fcgi-bin%2FhgTracks%3Fposition%3D/ 
   		and defined $attr->{target} 
    		and $attr->{target} =~ /^offsite$/){    
  		$flag = 3;
	}
 
#$Flag = 4 (dbSNP Id)
	if ($tag =~ /^a$/
		and defined $attr->{href}  
    		and $attr->{href} =~ /^\/redirect\.jsp\?p=http%3A%2F%2Fwww\.ncbi\.nlm\.nih\.gov%2Fentrez%2Fquery./ 
   		and defined $attr->{target} 
    		and $attr->{target} =~ /^offsite$/){    
  		$flag = 4;
	}
 
#$Flag = 5 (Variant)
	if ($tag =~ /^a$/
		and defined $attr->{href}
		and $attr->{href} =~ /^\/views\/reports\/loadVariantReport/){
		$flag = 5;
	} 
 
#$Flag = 6 (Feature / Amino Acid Translation / Number of Chromosomes / Assay Types)
	if ($tag =~ /^td$/){
		$flag = 6;
	}
 
 
#$Flag = 8 (Frequency)
	if ($tag =~ /^a$/ 
		and defined $attr->{href} 
		and $attr->{href} =~ /^\/views\/reports\/loadFrequencyInSampleSets/){
		$flag = 7;            
	} 
 
#$Flag = 9 (View == FIN)
	if ($tag =~ /^a$/ 
		and defined $attr->{href} 
		and $attr->{href} =~ /^\/views\/reports\/loadSampleAlleles/){
		$flag = 9;            
	} 
}
 
#*****************************************************************************************
sub text_rtn {
	my ($text) = @_;
   	$text =~ s/\n/ /g;
 
#$Flag = 1 (Titre) 
	if($flag == 1){                
        	print "Le titre : $text \n";
    	}
 
#$Flag = 2 (Menu)
	if($flag == 2){                
		push(@tab_menu, $text);	
    	}
 
#$Flag = 3 (GP Position)
	if($flag == 3){              
         	print "GP POSITION : $text \n";
    	}
 
#$Flag = 4 (dbSNP Id)
	if($flag == 4){ 
		my $base_URL_dbSNP_Id = 'http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp&cmd=search&term=';               
		my $url_dbSNP_Id = $base_URL_dbSNP_Id.$text;
		print "dbSNP Id : $text /// $url_dbSNP_Id\n";
    	}
 
#$Flag = 5 (Variant)
	if($flag == 5 && $text =~ /^[^0-9]\/.+$/){
		print "Variant : $text \n";
	}
 
#$Flag = 6 (Feature)
	if($flag == 6 && (($text eq 'Exon') || ($text eq 'Intron') || ($text eq 'NA') || ($text =~ /^[0-9]' UTR$/))){
		print "Feature : $text\n" ;
	}
 
#$Flag = 6 (Amino Acid Translation) 
	if($flag == 6 && ($text =~ /^[A-Z]{1}[a-z]{2}\/[A-Z]{1}[a-z]{2}$/ || $text =~ /^[A-Z]{1}[a-z]{2}$/)){
		print "Amino Acid Translation : $text\n";
	}
 
#$Flag = 7 (Frequency)
	if($flag == 7){
		print "Frequency : $text\n";
    	}
 
#$Flag = 6 (Number of Chromosomes)	
	if($flag == 6 && ($text =~ /^[0-9]/ && $text ne '3\' UTR' && $text ne '5\' UTR')){
		print "Number of Chromosomes : $text\n";
	}
 
#Flag = 6 (Assay Types)
	if($flag == 6 && (($text =~ /^[A-Z]{1}[a-z]{3,}$/ || $text =~ /^[A-Z]{1}[a-z]{3,}, [A-Z]{1}[a-z]{3,}$/) && $text ne 'Exon' && $text ne 'Intron'
	&& $text ne 'View' && $text ne 'Queries' && $text ne 'Drugs' && $text ne 'Genes' && $text ne 'Diseases' && $text ne 'Pathways')) {
		print"Assay types : $text\n";
	}
 
#$Flag = 9 (View == FIN)
	if($flag == 9){
		print"* * * * * * * * * * * * * *\n";
	}
}
 
#****************************************************************************************** 
sub end_rtn {
	my ($tag) = @_;
 
#$Flag = 1 (Titre) 
	if ($tag =~ /^\/title$/){
              	$flag = 0;
	}
 
#$Flag = 2 (Menu)
	if ($tag =~ /^\/thead$/){
    		my $Text_menu = join("\n ", @tab_menu);
    		print"===> MENU : \n $Text_menu\n";
		$flag = 0;
		print "****************\n";
	}
 
#$Flag = 3 (GP Position)
	if ($tag =~ /^\/a$/ && ($flag == 3)){
              	$flag = 0;
	}
 
#$Flag = 4 (dbSNP Id)
	if ($tag =~ /^\/a$/ && ($flag == 4)){
              	$flag = 0;
	}
 
#$Flag = 5 (Variant)
	if ($tag =~ /^\/a$/ && ($flag == 5)){
              	$flag = 0;
	}
 
#$Flag = 6 (Feature / Amino Acid Translation / Number of Chromosomes / Assay Types)
	if ($tag =~ /^\/td$/ && ($flag == 6)){
		$flag = 0;
	}
 
#$Flag = 8 (Frequency)
	if ($tag =~ /^\/a$/ && ($flag == 7)){
              	$flag = 0;
	}
 
#$Flag = 9 (View == FIN)
	if ($tag =~ /^\/a$/ && ($flag == 9)){
              	$flag = 0;
	}
}
 
#*******************************************************************************************
#start parsing
$parser->parse($page);
 
#end parser
$parser->eof;
 
#FIN PROGRAMME !!!

**Jedai** · 03/04/2008, 13h06

Envoyé par PerlPicker

As-tu vraiment besoin d'utiliser un parseur html? Récupérer toute la page avec LWP:Simple et l'analyser avec des expressions régulières est souvent plus simple et plus lisible. (avis très personnel!).

Et peu partagée ! Dès que la tâche devient un peu complexe, les regexps pour récupérer du HTML deviennent horriblement compliquées et très faciles à louper. Sauf pour les recupérations les plus élémentaires, utiliser un parser HTML est à la fois plus simple et plus robuste.

--
Jedaï

Problème récupérations d'informations dans des mêmes balises html

Langage Perl

Vue hybride

Discussions similaires

Partager

Partager