[XML] Améliorer le parsing
Bonjour à vous tous !
Voici mon petit problème du WE (bon WE à tous :D ). Voici un fichier XML que je veux parser :
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
|
<protein_group group_number="2" probability="1.00">
<protein protein_name="UniRef100_Q8IUK7" n_indistinguishable_proteins="3" probability="1.00" percent_coverage="3.8" unique_stripped_peptides="KVPQVSTPTLVEVSR" group_sibling_id="a" total_number_peptides="2">
<annotation protein_description="Similar to serum albumin precursor [Homo sapiens]"/>
<indistinguishable_protein protein_name="UniRef100_P02768">
<annotation protein_description="Serum albumin precursor [Homo sapiens]"/> </indistinguishable_protein>
<indistinguishable_protein protein_name="UniRef100_Q86YG0">
<annotation protein_description="Similar to alpha-fetoprotein [Homo sapiens]"/> </indistinguishable_protein>
<peptide peptide_sequence="KVPQVSTPTLVEVSR" charge="2" initial_probability="1.00" nsp_adjusted_probability="1.00" peptide_group_designator="a" weight="1.00" is_nondegenerate_evidence="Y" n_tryptic_termini="2" n_sibling_peptides="0.99" n_sibling_peptides_bin="3" n_instances="1" is_contributing_evidence="Y">
</peptide>
<peptide peptide_sequence="KVPQVSTPTLVEVSR" charge="3" initial_probability="0.99" nsp_adjusted_probability="1.00" peptide_group_designator="a" weight="1.00" is_nondegenerate_evidence="Y" n_tryptic_termini="2" n_sibling_peptides="1.00" n_sibling_peptides_bin="3" n_instances="1" is_contributing_evidence="Y">
</peptide>
</protein>
</protein_group> |
Et voici mon code qui parse ce fichier XML :
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
|
use XML::Parser;
# initialize the parser
my $parser = XML::Parser->new( Handlers =>
{
Start=>\&handle_start
});
$url = "./interact-prot.xml";
$parser->parsefile( $url );
open FILE, ">>Results.txt" or die "Peut pas ouvrir Result.txt !!";
print FILE "\nURL:$url";
my @stack;
sub handle_start {
my(@stack);
my($name, $desc, $val);
my( $expat, $protein, %attrs ) = @_;
push( @stack, { protein_group=>$protein});
if( %attrs ) {
while( my( $key, $value ) = each( %attrs )) {
if ($key eq "group_number") {
$val = $value;
open FILE, ">>Results.txt" or die "Peut pas ouvrir Result.txt !!";
print FILE "\nGroup number: $val\n";
}
}
while( my( $key, $value ) = each( %attrs )) {
if ($key eq "protein_name") {
$name = $value;
open FILE, ">>Results.txt" or die "Peut pas ouvrir Result.txt !!";
print FILE "AccessionNumber: $name\n";
}
}
while( my( $key, $value ) = each( %attrs )) {
if ($key eq "probability") {
$val = $value;
open FILE, ">>Results.txt" or die "Peut pas ouvrir Result.txt !!";
print FILE "probability=$val\t";
}
}
while( my( $key, $value ) = each( %attrs )) {
if ($key eq "percent_coverage") {
$val = $value;
open FILE, ">>Results.txt" or die "Peut pas ouvrir Result.txt !!";
print FILE "percent_coverage=$val\t";
}
}
while( my( $key, $value ) = each( %attrs )) {
if ($key eq "unique_stripped_peptides") {
$val = $value;
open FILE, ">>Results.txt" or die "Peut pas ouvrir Result.txt !!";
print FILE "unique_stripped_peptides=$val\n";
}
}
while( my( $key, $value ) = each( %attrs )) {
if ($key eq "peptide_sequence") {
$val = $value;
open FILE, ">>Results.txt" or die "Peut pas ouvrir Result.txt !!";
print FILE "peptide_sequence=$val\t";
}
}
while( my( $key, $value ) = each( %attrs )) {
if ($key eq "charge") {
$val = $value;
open FILE, ">>Results.txt" or die "Peut pas ouvrir Result.txt !!";
print FILE "charge=$val\t";
}
}
while( my( $key, $value ) = each( %attrs )) {
if ($key eq "initial_probability") {
$val = $value;
open FILE, ">>Results.txt" or die "Peut pas ouvrir Result.txt !!";
print FILE "initial_probability=$val\n";
}
}
}
} |
Vous l'aurez remarqué, je ne cherche que quelques infos. Toute fois, j'observe des redondances notamment au niveau des attributs probability. En outre, je voudrais faire le distingo entre l'attribut protein_name appartenant à la balise protein et l'attribut protein_name appartenant à la la balise indistinguishable_protein. Voyez-vous comment je pourrais faire ?
Je vous remercie d'avance pour votre aide et excusez la longueur du thread.
@ ++