Import .txt vers Panda Dataframe, problème header

**pwetzou** · 31/01/2017, 11h19

Bonjour à tous,

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
data = pd.read_csv ('G:/work/essai2.txt', delimiter="\t", header=0, index_col = [0], names=["x","y"])
print(data)
 
print(data['x'])

Je souhaiterais construire un histogramme horizontal à partir d'un fichier .txt.

Comme el fichier contient une colonne de Strings et une colonne de Integer, je pense qu'un dataframe est mieux indiqué qu'un array.
Je souhaiterais mettre mes contenus de colonne dans des series pour les 'plot'.

Cependant pour une raison que j'ignore, je n'arrive pas à importer correctement ma donnée et son header.
Il ne place pas les headers "x" et "y" au même niveau et de fait je ne peux pas sélectionner mes colonnes.

Pouvez-vous m'éclairer sur ce problème simple ?
Note : j'ai revérifié mon fichier source, et essayé en txt et csv, je ne vois aucun espace "nuisble"..

Nom : 1.png
Affichages : 9793
Taille : 15,9 Ko

Nom : 1.png
Affichages : 9793
Taille : 15,9 Ko

Merci par avance,
Cdt,
G.

**pwetzou** · 31/01/2017, 11h32

Je viens de me rendre compte que le problème vient de l'attribut index_col.

Si je le supprime, j'ai bien mes données en ordre ligne par ligne.
En revanche, n'existe-t-il pas un moyen de faire comprendre à python que je veux cacher mes index, mais que ma 1ère ligne contient bien mes en-tête ?

Car si je procède ainsi :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
data = pd.read_csv('G:/work/essai2.txt', delimiter="\t", header=0)
print(data)
 
print(data['y'])
print(data['x'])
 
data_x = data['x']
print(data_x)

J'obtiens à la fois ma colonne index et ma colonne 'x'...

**Julien N** · 31/01/2017, 14h03

Salut,

Il me semble qu'un DataFrame contient obligatoirement des columns et des indexs, que ce soient ceux par défauts ou ceux renseignés. Lorsque l'on affiche (via print) le tableau, les index sont aussi affichés. Je ne pense pas que l'on puisse ne pas les afficher.

Si vous ne spécifiez pas l'indice de l'argument header, donc en faisant:

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

data = pd.read_csv('G:/work/essai2.txt', delimiter="\t")

Vous devriez obtenir la même chose.

J

**pwetzou** · 31/01/2017, 14h12

Merci pour la réponse.

En effet j'obtiens la même chose.

Maintenant ma question :
- Est-ce-que si je stock ma colonne 0 (x) dans une série panda pour plot, va t'il m'afficher les index aussi dans la série ?

Je vais tester ceci à côté.
G.

**pwetzou** · 31/01/2017, 14h52

C'est bon, je m'en suis sorti :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
 
data = pd.read_csv('G:/work/essai2.txt', delimiter="\t")
 
print(data)
#data2 = data.sort_index(by=['x', 'y'], ascending=[True, False]) #Sort by alphabetical order cause panda detects string inside
data2 = data.sort_index(by=['y', 'x'], ascending=[True, False]) #Sort from smallest to highest value (initial is highest to smallest)
print(data2)
 
data_x = str(data2['x']) #make a panda serie containing STR out of the X column
data_y = data2['y'] #make a panda serie containing INT out of the Y column
data_index = data2.iloc[:,0] #make a panda serie containing x with iloc method
data_index2 = list(data2.index) #make a panda serie containing the index of the whole file. this does support slicing.
 
freq_series = pd.Series(data_y)

Maintenant il ne me reste plus qu'à paramétrer le graph, je souhaiterais afficher mes 'x' en valeurs de l'axe Y et mes valeurs 'y' du plus petit à gauche vers le plus grand à droite sur l'axe X
Je souhaiterais appliquer la suite sur ce modèle :

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
plt.figure(figsize=(8, 4)) # adjusting graph size / proportions
ax = freq_series.plot(kind='barh', label='Filing\nApplications', color='orange')
ax.set_title("blabla")
ax.set_xlabel("Years")
ax.set_ylabel('blabla')
ax.set_xticklabels(occ) # tickles for X axis
ax.set_yticklabels(label) # tickles for X axis
 
rects = ax.patches
 
# Rotating the xticklabels for years
for label in ax.get_xmajorticklabels():
    label.set_rotation(30)
    label.set_verticalalignment("top")
 
# Making the labels
def autolabel(rects, ax):
    # Get y-axis height to calculate label position from. 
    (x_bottom, x_top) = ax.get_ylim()
    x_height = x_top - x_bottom
 
    for rect in rects:
        height = rect.get_height()
 
        # Fraction of axis height taken up by this rectangle
        p_height = (height / x_height)
 
        # If we can fit the label above the column, do that;
        # otherwise, put it inside the column.
        if p_height > 0.95: # arbitrary; 95% looked good to me.
            label_position = height - (x_height * 0.05)
        else:
            label_position = height + (x_height * 0.01)
 
        ax.text(rect.get_x() + rect.get_width()/2., label_position,
                '%d' % int(height),
                ha='right', va='bottom')
 
autolabel(rects, ax)

**Julien N** · 31/01/2017, 18h14

data_x = str(data2['x']) #make a panda serie containing STR out of the X column
data_y = data2['y'] #make a panda serie containing INT out of the Y column
data_index = data2.iloc[:,0] #make a panda serie containing x with iloc method
data_index2 = list(data2.index) #make a panda serie containing the index of the whole file. this does support slicing.

freq_series = pd.Series(data_y)

Je ne vois pas trop ce que vous voulez faire ici. Il faut savoir qu'un DataFrame est composé de Series. Donc quand vous faites:

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

freq_series = pd.Series(data_y)

, c'est pareil que:

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

freq_series  = data2['y']

Si je comprends bien là où vous voulez en venir, c'est transformer vos données de sorte à pouvoir faire ceci:

Code :

Sélectionner tout - Visualiser dans une fenêtre à part

ax = freq_series.plot(kind='barh', label='Filing\nApplications', color='orange')

Si c'est bien cela, vous pourriez nous donner un extrait des données brutes et un extrait des données travaillées freq_series, pour que l'on se figure un peu mieux la chose

J

Import .txt vers Panda Dataframe, problème header

Python

Vue hybride

Discussions similaires

Partager

Partager