Ecrire bit à bit des données de taille variable dans un byte[] ?

**StringBuilder** · 19/07/2016, 19h11

Bonjour,

Ça doit faire 15 ans que je me suis cassé les dents pour la première fois sur un bête TP d'école de programmation, et en voulant vérifier si j'étais toujours aussi bête, je me rends compte que oui : je n'y arrive toujours pas.

Je souhaite mettre en place, histoire de relever le défit, une compression de fichier à l'aide du codage de Huffman.

Aucun souci pour construire l'arbre. A mon avis, ça pique les yeux niveau performances, mais bon, ça marche, c'est déjà ça.

En revanche, une fois que j'ai construit mon dictionnaire, comment stocker dans un byte[] (ou un Stream) mes données compressées ?

Ma phrase en entrée :

"la vie, c'est comme une boîte de chocolats : on ne sait jamais sur quoi on va tomber"

On dictionnaire :

c:1111
m:1110
b:11011
l:11010
::110011
?:110010
':110001
,:110000
:10
h:011111
j:011110
q:011101
d:011100
s:0110
t:0101
a:0100
u:00111
v:001101
r:001100
n:00101
i:00100
o:0001
e:0000

Ok...

Mais maintenant, comment obtenir :

1101001001000110100100000011000010 [...]

En effet, quand je vais récupérer le "l", je vais obtenir 0x11010.
=> Déjà pour le mettre à sa place dans le premier byte, je vais devoir le décaller vers la gauche de 3 bits : comment le savoir à l'avance ?
Ensuite, quand je vais récupérer de mon arbre le "a", je vais obtenir "0100", donc 0x100... Comment savoir que j'ai perdu le 0 de poids fort ?

Dois-je passer par une structure {byte, byte} avec d'un côté "0100" et de l'autre côté "4" de façon à savoir que je dois bien avoir 4 bits, y compris le 0 de poids fort ? Une autre solution ?

Mais ensuite, quand je vais arriver dans mon byte contenant déjà le "l" : 11010..., comment je vais savoir découper mon "0100" en deux pour compléter le premier byte et en entamer un suivant ? 11010010 0....... ?

Là je sèche. Je n'ai que des usines à gaz en tête... Soit des trucs assez peu consommateurs en mémoire, mais bouffeurs de cycles CPU, soit l'inverse...

**StringBuilder** · 20/07/2016, 16h48

Bon, finalement, la nuit portant conseil, j'ai fini par y arriver

En revanche, je me demande bien comment stocker de façon efficace le dictionnaire...

En gros, j'ai une référence "root" (point d'entrée de mon arbre) de ce type :

Code csharp :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
 
    /// <summary>
    /// Represents an Huffman tree node.
    /// It will always have two child nodes or zero.
    /// If it has child nodes, then it doesn't have a valude, otherwise it does.
    /// </summary>
    public class HuffmanNode : IComparable
    {
        private HuffmanNode[] nodes;
        private byte data;
        private int weight;
 
        /// <summary>
        /// Creates a valued node.
        /// </summary>
        /// <param name="data">Value</param>
        /// <param name="weight">Number of occurences of the value in the input document</param>
        public HuffmanNode(byte data, int weight)
        {
            nodes = null;
            Data = data;
            Weight = weight;
        }
 
        /// <summary>
        /// Create a parent node.
        /// </summary>
        /// <param name="left">Left node</param>
        /// <param name="right">Right node</param>
        public HuffmanNode(HuffmanNode left, HuffmanNode right)
        {
            nodes = new HuffmanNode[2];
            Left = left;
            Right = right;
        }
 
        /// <summary>
        /// Node's data
        /// </summary>
        public byte Data
        {
            get
            {
                return data;
            }
            set
            {
                data = value;
            }
        }
 
        /// <summary>
        /// Node's weight (weight of the data, or sum of the  children nodes' weight).
        /// </summary>
        public int Weight
        {
            get
            {
                if (Left != null && Right != null)
                {
                    return Left.Weight + Right.Weight;
                }
                else
                {
                    return weight;
                }
            }
            set
            {
                weight = value;
            }
        }
 
        /// <summary>
        /// Left child node
        /// </summary>
        public HuffmanNode Left
        {
            get
            {
                if (nodes == null)
                {
                    return null;
                }
                else
                {
                    return nodes[0];
                }
            }
            set
            {
                nodes[0] = value;
            }
        }
 
        /// <summary>
        /// Right child node
        /// </summary>
        public HuffmanNode Right
        {
            get
            {
                if (nodes == null)
                {
                    return null;
                }
                else
                {
                    return nodes[1];
                }
            }
            set
            {
                nodes[1] = value;
            }
        }

Comment stocker ça sans perdre plein de place ?

En effet, le plus "simple" serait de stocker chaque feuille sous forme :
byte data
uint code
byte length

Mais cela me fait 6 octets par byte présents dans le dictoinnaire.

Ce qui équivaut, pour la chaine "toto va à la plage" à 10 * 6 = 60 octets pour stocker le dictoinnaire... alors qu'une fois encodé, tout tiens en 4 ou 5 octets, et que la chaîne d'origine fait 18 octets...

D'autant que dans ce cas :
- le code n'a aucun intérêt à être stocké sur 4 octets, puisqu'il ne dépasse pas 4 ou 5 bits au max
- lenght n'a pas besoin non plus d'être sur un byte entier puisque 5 s'encode sur 3 bits

Sauf que si je me retrouve avec un document contenant chaque valeur possible de byte, et avec des fréquences très variées, je peux potentiellement avoir des codes très longs (même si j'ai un doute quant au fait qu'il faille réellement un uint (ou plus ?) pour stocker les codes les plus longs).

**StringBuilder** · 20/07/2016, 18h37

Bon ben finalement je me suis un peu aidé d'un topic de stackoverflow pour ce dernier point...

Ça m'a permis aussi de passer par des classes "BitReader" et "BitWriter", ce qui a largement simplifié mon code.

En revanche, avant j'écrivais directement plusieurs bits d'un coup, alors qu'avec ces classes je fais une boucle inutile... Pas le courage d'optimiser ça pour le moment, le principal c'est que ça marche

Code scharp :

Sélectionner tout - Visualiser dans une fenêtre à part

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
 
using System;
using System.Collections.Generic;
 
using System.IO;
 
namespace MagicHuffman
{
    /// <summary>
    /// Represents an Huffman tree node.
    /// It will always have two child nodes or zero.
    /// If it has child nodes, then it doesn't have a value, otherwise it does.
    /// </summary>
    public class HuffmanNode : IComparable
    {
        private HuffmanNode[] nodes;
        private byte data;
        private int weight;
 
        /// <summary>
        /// Creates a valued node.
        /// </summary>
        /// <param name="data">Value</param>
        /// <param name="weight">Number of occurences of the value in the input document</param>
        public HuffmanNode(byte data, int weight)
        {
            nodes = null;
            Data = data;
            Weight = weight;
        }
 
        /// <summary>
        /// Create a parent node.
        /// </summary>
        /// <param name="left">Left node</param>
        /// <param name="right">Right node</param>
        public HuffmanNode(HuffmanNode left, HuffmanNode right)
        {
            nodes = new HuffmanNode[2];
            Left = left;
            Right = right;
        }
 
        /// <summary>
        /// Node's data
        /// </summary>
        public byte Data
        {
            get
            {
                if (nodes != null)
                {
                    throw new InvalidOperationException("This node doesn't have a value as it has children.");
                }
                return data;
            }
            set
            {
                if (nodes != null)
                {
                    throw new InvalidOperationException("This node can't have a value as it has children.");
                }
                data = value;
            }
        }
 
        /// <summary>
        /// Node's weight (weight of the data, or sum of the children nodes' weight).
        /// </summary>
        public int Weight
        {
            get
            {
                if (nodes != null)
                {
                    return Left.Weight + Right.Weight;
                }
                else
                {
                    return weight;
                }
            }
            set
            {
                if (nodes != null)
                {
                    throw new InvalidOperationException("This node weight can't be set as it has children.");
                }
                weight = value;
            }
        }
 
        /// <summary>
        /// Left child node
        /// </summary>
        public HuffmanNode Left
        {
            get
            {
                if (nodes == null)
                {
                    return null;
                }
                else
                {
                    return nodes[0];
                }
            }
            set
            {
                if (nodes == null)
                {
                    throw new InvalidOperationException("This node can't have children as it is valued.");
                }
                nodes[0] = value;
            }
        }
 
        /// <summary>
        /// Right child node
        /// </summary>
        public HuffmanNode Right
        {
            get
            {
                if (nodes == null)
                {
                    return null;
                }
                else
                {
                    return nodes[1];
                }
            }
            set
            {
                if (nodes == null)
                {
                    throw new InvalidOperationException("This node can't have children as it is valued.");
                }
                nodes[1] = value;
            }
        }
 
        /// <summary>
        /// Compares the weight of the node
        /// </summary>
        /// <param name="obj">Another HuffmanNode instance</param>
        /// <returns>Comparision of the weight of the two nodes</returns>
        /// <exception cref="ArgumentException">obj must be a HuffmanNode</exception>
        public int CompareTo(object obj)
        {
            if (obj is HuffmanNode)
            {
                return this.Weight.CompareTo((obj as HuffmanNode).Weight);
            }
            throw new ArgumentException("Object is not a HuffmanNode");
        }
 
        /// <summary>
        /// Serialize (for debug purpose only) the node
        /// </summary>
        /// <param name="prefix">Prefix to add (from the parent nodes)</param>
        /// <returns>String representing the serialized node</returns>
        public string ToString(string prefix)
        {
            if (nodes == null)
            {
                return string.Format("{0}:{1}\r\n", (char)Data, prefix);
            }
            else
            {
                return string.Format("{0}{1}", Left.ToString(prefix + "1"), Right.ToString(prefix + "0"));
            }
        }
 
        public void StoreNode(BitWriter writer)
        {
            if (nodes == null)
            {
                writer.WriteBit(1);
                writer.WriteByte(Data);
            }
            else
            {
                writer.WriteBit(0);
                Left.StoreNode(writer);
                Right.StoreNode(writer);
            }
        }
 
        public static HuffmanNode ReadNode(BitReader reader)
        {
            if (reader.ReadBit() == 1)
            {
                return new HuffmanNode(reader.ReadByte(), 0);
            }
            else
            {
                HuffmanNode left = ReadNode(reader);
                HuffmanNode right = ReadNode(reader);
                return new HuffmanNode(left, right);
            }
        }
    }
 
    public class BitWriter
    {
        private int bitPosition;
        private Stream output;
        private byte currentByte;
 
        public BitWriter(Stream s)
        {
            output = s;
            bitPosition = 0;
            currentByte = 0;
        }
 
        public void WriteBit(byte bit)
        {
            currentByte |= (byte)(bit << 7 - bitPosition);
            bitPosition++;
            if (bitPosition == 8)
            {
                output.WriteByte(currentByte);
                currentByte = 0;
                bitPosition = 0;
            }
        }
 
        public void WriteBits(uint bits, byte length)
        {
            // On doit pouvoir optimiser ça
            for (byte i = 0; i < length; ++i)
            {
                WriteBit((byte)((bits >> 31 - i) & 0x1));
            }
        }
 
        public void WriteByte(byte value)
        {
            WriteBits(((uint)value << 24), 8);
        }
 
        public void WriteHuffmanCode(HuffmanCode code)
        {
            WriteBits(code.Bits, code.Length);
        }
 
        public void Flush()
        {
            if (bitPosition > 0)
            {
                output.WriteByte(currentByte);
            }
            output.Flush();
        }
    }
 
    public class BitReader
    {
        private byte bitPosition;
        private Stream input;
        private byte currentByte;
 
        public BitReader(Stream s)
        {
            input = s;
            input.Seek(0, SeekOrigin.Begin);
            bitPosition = 0;
            currentByte = (byte)input.ReadByte();
        }
 
        public byte ReadBit()
        {
            if (bitPosition == 8)
            {
                currentByte = (byte)input.ReadByte();
                bitPosition = 0;
            }
            return (byte)((currentByte >> 7 - bitPosition++) & 0x1);
        }
 
        public byte ReadByte()
        {
            byte value = 0;
            for (byte i = 0; i < 8; ++i)
            {
                value |= (byte)(ReadBit() << 7 - i);
            }
            return value;
        }
 
        public uint ReadBits(byte length)
        {
            uint res = 0;
            // On doit pouvoir optimiser ça
            for (byte i = 0; i < length; ++i)
            {
                res |= (uint)ReadBit() << (31 - i);
            }
            return res;
        }
 
    }
 
    /// <summary>
    /// Represents a "huffman code" (ie. 01101) and its length in order to make difference between 001 and 1.
    /// </summary>
    public struct HuffmanCode
    {
        /// <summary>
        /// Code
        /// </summary>
        public uint Bits;
 
        /// <summary>
        /// Length of the code
        /// </summary>
        public byte Length;
    }
 
    /// <summary>
    /// Use this class to compress and decompress a document using Huffman encoding
    /// </summary>
    public class Huffman
    {
        /// <summary>
        /// Buffer size to be read/write from the streams. Can be freely tuned.
        /// </summary>
        private const int BUFFER_SIZE = 4096; // 4 Ko
        private const int LAST_BIT_POS = BUFFER_SIZE << 3;
 
        /// <summary>
        /// Top of the Huffman tree (used to create the dictionary and to decode)
        /// </summary>
        private HuffmanNode root = null;
 
        /// <summary>
        /// Dictionary used to encode.
        /// A SortedDictionary<T> may be used here, but I guess a simple array (with holes) is faster.
        /// </summary>
        private HuffmanCode[] dictionary = new HuffmanCode[256];
 
        /// <summary>
        /// Number of bytes in the uncompressed document
        /// </summary>
        private uint DocumentLength = 0;
 
        /// <summary>
        /// Instanciate a new empty huffman tree. It can be used only for decompression.
        /// </summary>
        public Huffman()
        {
        }
 
        /// <summary>
        /// Instanciate a new Huffman tree from a source document.
        /// </summary>
        /// <param name="s">Source document</param>
        public Huffman(Stream s)
        {
            s.Seek(0, SeekOrigin.Begin);
 
            // Create a temporary dictionary to enumerate any byte value and compute its weight
            List<HuffmanNode> bytes = new List<HuffmanNode>(256);
            for (int i = 0; i < 256; i++)
            {
                bytes.Add(new HuffmanNode((byte)i, 0));
            }
 
            // Document will be read by blocks
            byte[] buffer = new byte[BUFFER_SIZE];
            int read = 0;
 
            // Read all the document to enumarate each byte weight
            while ((read = s.Read(buffer, 0, BUFFER_SIZE)) > 0)
            {
                DocumentLength += (uint)read;
                for (int i = 0; i < read; ++i)
                {
                    // Increment byte weight
                    bytes[buffer[i]].Weight++;
                }
            }
 
            // Remove any unused byte
            for (int i = 255; i >= 0; --i)
            {
                if (bytes[i].Weight == 0)
                {
                    bytes.RemoveAt(i);
                }
            }
 
            // Construct the Huffman tree
            while (bytes.Count > 1)
            {
                // Sort nodes from lowest weight to highest
                bytes.Sort();
 
                // Merge the two lowest weighted nodes togethers
                HuffmanNode node = new HuffmanNode(bytes[0], bytes[1]);
                bytes.RemoveAt(1);
                bytes[0] = node;
            }
 
            // Root node of the tree is the lastest remaning node in the list
            root = bytes[0];
 
            // Destroy the List<> the is not used anymore
            bytes = null;
 
            // Build the dictionary used to encode (compress)
            BuildDictionary(root, 0, 0);
        }
 
        /// <summary>
        /// Builds a dictionary used to compress a document using the Huffman tree
        /// </summary>
        /// <param name="node">Parent node</param>
        /// <param name="code">Prefix code from the parent node</param>
        /// <param name="depth">Code length (depth of the node in the tree)</param>
        private void BuildDictionary(HuffmanNode node, uint code, byte depth)
        {
            if (node.Left != null)
            {
                // The node has children : build dictionary from both children
                ++depth;
                BuildDictionary(node.Left, (code << 1) | 0x1, depth);
                BuildDictionary(node.Right, (code << 1), depth);
            }
            else
            {
                // The node has no child : build dictionary from both children
                dictionary[node.Data] = new HuffmanCode() { Bits = code << (32 - depth), Length = depth };
            }
        }
 
        /// <summary>
        /// Compresses the input stream to the output stream using huffman encoding
        /// </summary>
        /// <param name="input">Source document</param>
        /// <param name="output">Compressed document</param>
        public void Compress(Stream input, BitWriter output)
        {
            // Step 1 : Store document length
            output.WriteBits(DocumentLength, 32);
 
            // Step 2 : Store dictionary
            root.StoreNode(output);
 
            byte[] inputbuffer = new byte[BUFFER_SIZE];
            int read = 0;
 
            input.Seek(0, SeekOrigin.Begin);
            while ((read = input.Read(inputbuffer, 0, BUFFER_SIZE)) > 0)
            {
                for (int i = 0; i < read; ++i)
                {
                    output.WriteHuffmanCode(dictionary[inputbuffer[i]]);
                }
            }
 
            output.Flush();
        }
 
        /// <summary>
        /// Uncompresses input stream to output stream using huffman encoding
        /// </summary>
        /// <param name="input">Compressed document</param>
        /// <param name="output">Original document</param>
        public void DeCompress(BitReader input, Stream output)
        {
            // Step 1 : Load document length
            DocumentLength = input.ReadBits(32);
 
            // Step 2 : Load dictionary
            root = HuffmanNode.ReadNode(input);
 
            //byte[] inputbuffer = new byte[BUFFER_SIZE + 4];
            byte[] outputbuffer = new byte[BUFFER_SIZE];
            //int bitPosition = 0;
            int bytePosition = 0;
            int totalLength = 0;
 
            //input.Seek(0, SeekOrigin.Begin);
            //read = input.Read(inputbuffer, 0, BUFFER_SIZE + 4);
            while (totalLength < DocumentLength)
            {
                outputbuffer[bytePosition++] = GetByteFromBits(input);
                totalLength++;
                if (bytePosition == BUFFER_SIZE)
                {
                    output.Write(outputbuffer, 0, BUFFER_SIZE);
                    bytePosition = 0;
                }
            }
            if (bytePosition > 0)
            {
                output.Write(outputbuffer, 0, bytePosition);
            }
        }
        /// <summary>
        /// Gets a byte from bits using the instance's huffman dictionary
        /// </summary>
        /// <param name="reader">BitReader where to read bits from</param>
        /// <returns>Decoded byte from the dictionary</returns>
        public byte GetByteFromBits(BitReader reader)
        {
            HuffmanNode currentNode = root;
 
            while (currentNode.Left != null)
            {
                if (reader.ReadBit() == 1)
                {
                    currentNode = currentNode.Left;
                }
                else
                {
                    currentNode = currentNode.Right;
                }
            }
            return currentNode.Data;
        }
 
        /// <summary>
        /// Serialize (for debug only) the huffman tree of the instance
        /// </summary>
        /// <returns></returns>
        public override string ToString()
        {
            return root.ToString(string.Empty);            
        }
    }
}

Ecrire bit à bit des données de taille variable dans un byte[] ?

C#

Vue hybride

Discussions similaires

Partager

Partager