In Jpeg algorithm, not only 3) quantization loses precision, 1)RGB to YCbCr, 2) fourier transform also cause the same problem. In the document I read, 1) and 2) are supposed to be lossless however.
Finish the naive version of model
Feb 4
Visualize the model
Debug the model
ToTensor scale the input by 1/255. I am concerned about the weight of quantization matrix
Quantization matrix does not change much, it may has already stay in the local minimal, let's give it some random weights
Another question is, how can I tell the tool "better" matrix (that minimize devision results in many sense)
Feb 14
Gradient was abnormal: it seems round() and clamp() are killing my gradient, so I rewrite them. Besides, I tried to scale all operations back to the range of 0 and 1. This is legal because rgb2ycbcr, dct are linear operations. Quantization, as long as I put the quantization matrix into the range of 0 and 1. It gives the same results.
Backward propagation is working normally, the evidence is all quantize terms vanishes to 1, which elliminates all differences between graph after and before Jpeg compression. This arouses another problem of regularization.
I try to regularize quantize terms. However, the quantization matrix goes to another extreme situation: either 1 for small row and col indexes, or 255 for large row and col indexes.
Another point confusing me is when the regulation term has huge difference in magnitude, e.g. 1000 vs. 1, to normal loss function. It seems that the loss function is still converging. One possible explanation is addition basically make the gradient calculation independent to each other. Then other weight still changes to the proper places as expected. Maybe I should stop training of all parameters except for the jpeg layer.
Feb 20
No obvious difference between jpeg version and non-compressed version (10 epochs, resnet, top1)
The compression rate is stable for certian qtable. We can therefore train our own qtables (which should also gives similar plot).
quality = 20, mean = 21000 bytes, 7.5x comp
quality = 50, mean = 40000 bytes, 3.7x comp
<img src=figures/jenna20.png width=350>
<img src=figures/jenna50.png width=350>
Experiments are conducted to see: (a) If the Compression Leading to Acc Loss problem exists? (b) If the problem exists, can we design better qtable with same compression rate but higher/same accuracy?
Setup:
Dataset of 3 classes, 1384 for training and 142 for validating. Is the validation data too small?
Not large accuracy differences between low quality and high quality(variation of 94 to 96). Taining with low quality data can fix everything.
Perhaps we can use larger validation set.
Or solve the test-quality issue: train a network with high quality images, but testing get affected. (But why not just train with low quality images? How about fine tunning?)
Change to other topics, not image classification
Validation uses qtable as integer, which opens a gap between training and validation accuracy. Sometimes we have 0.33 accuarcy for validation. Do we truly need integer qtable?
Semantic segmentation for MIT ADE20K dataset, encoder resnet50dilated, decoder ppm_deepsup, epoch 20.
No fake uncompression: a) downsampling cannot simulate uncomp; b) annotation becomes tricky; c) by default, the input get random scaled, why.
!<img src=figures/orig.png width=7050>
jpeg, quality 20
Jpeg, quality = 20
62.5%
53.6%
<img src=figures/uncomp.png width=350>
<img src=figures/jpeg20.png width=350>
Classification
Compression rate difference between uncompressed image and fake uncompressed images (resnet, epoche = 25, 1/L1 regularization, factor = 0.1, different dataset tricky).
uncomp (jenna)
fake uncomp (comp jenna, down sample)
jpeg 20
my jpeg 20
21000 bits, 7.5x compression
7300 bits, 5x compression
Observations
The gap of 2 times compression rate difference is too large. Fake uncompression make not be a good approximate for uncompressed image, especially when our focus is compression rate.
Quantization table may differ from application to application (my jpeg on different classification tasks).
Design questions
How to initialize?
Initialize with standard qtable and random value. The performance of the latter is really bad. Here is the training result of 3 classes classification task.
How to design regularization term?
Observe the change of qtable of same initialization: a) All terms increases and then the first frequency component increase slower and then decreases. In zig-zag manner, 2-3 then get affectedm, and then 4-6. b) Chrominance and illuminance differences.
1/L1 regularization may not simulate compression rate well. Integrate compression? Will batch size change the behavior of training?
What if the high frequency component hypothesis is wrong?
Train with fixed cnn and observe the behavior of qtable. It always decreases from high frequency components!
April 12
Compressed file size vs. $\sum{i=0}^{63} F{i} / Q_{i}$
<img src=figures/rate_qtable.png width=350>
Fixing some error or system problem:
remove abs in normalization
Strip unit test shows that python is bad when precision is too high. The solution I give is to mask $std <10^{-5}$ components to zero. (change to mean of absolute value)
Setting training rate to 0.001 fix 96% accuracy problem. Setting regularization factor to 0.05 gives 96.0% accuracy (as good as without jpeg). Mask lower half of the qtable gives 95.5% accuarcy. Setting regularizaiton factor to 0.3 gives 95.3% accuracy, and the compressed figure looks quite different.
Problems:
Cannot reproduce > 90% accuracy result without pretrained model. I hope it is initialized without pretrained parameter since the starting point may affect where qtable converges to. (train with simple model)
Checkerboard exam shows it's hard to distinguish "importance" of qtable non-zero terms, no magnitude difference. (explain checkerboard; )
April 24
To know if current network is able to learn the importance of qtable, we need to test with specific dataset.
One way to do this is to create a 8x8 frequency table as DCT of some input 8x8 pixels, where only the center 4x4 pixels are a) vertically b) horizontally increasing. Other pixels are uniformly random.
However, it turns out applying inverse dct to randomly generated frequency cannot produce pixels in the range of 0 to 255. Shifting and scaling the pixels doesn't work. It will change the frequency distribution.
The other approach I tried was to randomly generate 8x8 pixels, transform them to 8x8 frequency table. Then sorting the center 4x4 frequencies and use inverse dctII to transform them back and round to integers. I also do the dctII transformation to verify if see if integers round keeps sorting, and they are just normal.
Setup (10 epoches)
Accuracy
mask frequency except 4x4 center frequencies, no normalization
100%
mask frequency except 4x4 center frequencies, with normalization
99.5%
no masking, no normalization (qtable seems random)
97.5%
no masking, with normalization (qtable seems random)
96.5%
quatization table fxied, no masking
99.5%
without regularization, start with medium level qtable, learn center convergence rate
look at math work! with simple network (jpeg layer + linear layer)
June 10th
Fixed point
word length :word length of each fixed point number
floating length: fractional length of each fixed point number
Floating point
mantissa: a signed (meaning positive or negative) digit string of a given length in a given base.
exponent: the magnitude of the number.
exp = 5, man = 2
exp = 2, man = 1
nearest
94.6%
91.0%
stochastic
94.7%
92.5%
Limited knowledge we can borrow from qpytorch:
double ldexp(double value, int exp); //return $value*2^{exp}$
Adaptive quantization of neural networks (no source code, sad)
It generalizes quantization problem:
$\min \limits_{W} N_Q(W) = \sum N_q(ω_i)$
$L(W ) ≤ \overline{l}$
where $N_q(ω_i)$ is the minimum number of bits required to represent $ω_i$ in its fixed-point format.
Given $ωi q ← round(2^{N_q^i} ω_i)/2^{N_q^i}$ (assuming $w_i \in (0,1)$), it can be deduced that $N_Q(W) \leq -\sum^n{i=1} log_2 \tau_i $
Under specific loss function, we can solve $\tau$ with KTT conditions and the learn $N_Q(W)$. (The only problem is I don't fully understand the proof start from KTT condition...)
Relaxed quantization for discretized neural network
Algorithm 1 is currently being tested. I understand each step, except where to deduce $\epsilon$.
It is workable (visualization seems fine) for the first few trails and get nan soon.
How to develop unit test...
July 1st
Question about the algorithm
$r$
is
$[- 2^b - 0.5, ..., 2^b - 0.5]$
. It is not symmetric. Why?
Why
$\sum z_i g_i$
gives the original value?
$\sigma$
: The meaning for the value. Set to
$\alpha / 3$
all the time or only at initialization.
Resizing: statically (no need for extra storage things when recover) or danamically.
When bandwidth
$b$
is small enough (e.g. 2), it gives a lot of grids with black frame.
Training and testing outputs very different jpeg.
Hard to tell differences between training alpha or without alpha.
bits #
train Jpeg
train cnn only
8
95.8%
5
95.5%
95.8%
3
87.5%
86.6%
July 16th
We look at the deplorable space and find with sorted or little perturbed jpeg, we can "better" compression rate given the same accuracy. It's a good indication that there is a lot more space for qtable designers.
Problems:
Choice of algorithm
Runtime
Objective function: 1/(comp_rate*acc) does not reflect a curve, but a point
July 31th
Let's switch to Object Detection, since the dataset is designed for it. Before that, I am doubting about the compression rate we get though. The graph may indicate that the optimal points on the previous result will change with different dataset.
Actually, when the figure is large, compression may not affect the learning accuracy in the sense that 8 by 8 blocks does not affect overview for an image that large (6000by4000pixels). We didn't noticed the problem for classification because I compressed the data has already been downsampled.
Standard jpeg comrpessed image for object detection.
Visualization for images compressed with different quality factors:
quality 5
quality 15
quality 30
On one hand, compression itself doesn't makes a figure hard to distinguish, but downscaling along with compression does.
On the other hand, to store a "small" image, compression is never enough. It could be an important factor though. Then I believe it is persuasive to play with the smaller images with a better resolution (600x400 pascalraw and imagenetv2) actually well simulates real database generation.
Aug 8th
Compression distribution
It seems compression rate is most likely to simulates a skewed normal distribution, though I don't fully understand mean and variance here.
Another concern is it's hard to do cross validation/ bucket test for ImagenetV2 since it only has 10 images in each class. (Cropping?)
ImageNetV2 (jpeg)
quality 10
quality 15
Also, for ImageNetV2 with different input scales, we would like to use jpeg size as baseline. This well simulates the real situation that bmp is not a usual storage method for neural network datasets.
ImagenetV2 datasets - libjpeg compression rate vs. accuracy.
The compression rate is calculated based on (original .jpeg file sizes/compressed .jpg files sizes) due to the wild file size. Also, it is fair since in neural network applications, people only store images as PNG or JPEG files, not bmps.
The compression rate doesn't variate a lot, though I need to check with chris later. Error bar calculation for compression rate:
Imagenet V2 is a nice dataset to work on due to the efficiency. However, we may not be able to do all the cross validations on accuracy.
Aug 24th
After recreating the ImagenetV2 dataset from flickr API and generating our own sorted random qtables to compress them, it turns out that the random results are not necessarily better than standard jpeg which but can reaches to some place close.
The MCMC results only degrades probably because the table is changed each time by pure random number in the range [-step, step]. The chance of producing a better quantization table seems to be subtle.
The genetic algorithm version 1 could reaches a point similiar to one standard jpeg point.
This is because the fitness function is simple set as
$Acc \cdot CompRate$
, while at the beginning the population is initialized by 15 sorted random qtables.
In genetic algorithm version 2, I set fitness function as
$\frac{Rate}{f(Acc)}$
where
$f(Acc)$
is a cubic polynomial such that
$f(Acc) \approx Rate$
. In this case, the results form a curve rather than a single point, but not yet beating the standard qtable.
Aug 28th
Methods I tried so far:
MCMC - randomly pick some coef of the Qtable and apply step function with random changes in the range of [-7,7]. Too few changes in the end, not any better than standard jpeg
MCMC - randomly pick some coef of the Qtable to be the average of 5 values given a box [[0,1,0],[1,1,1],[0,1,0]]. Eventually they become one value, not any better than standard jpeg.
GA - p% of parent1 and (1-p%) parent2. Gradually becomes the same Qtable. No matter how many times I tried, GA always gradually converges one point after trials.
Comments for GA.
Usually at the starting point because the generated population is random, larger compression rate usually shows better results in terms of accuracy. Then no matter how we pick parent, these parents always have large compression rate. (How do we constrain the compression rate to be around 20? )
There is also a reason for these easily generated Qtables that can produce large compression rate - low accuracy pair. Because when every coef of Qtable is large, most component of the 8x8 DCT block just get disappeared. It is easy to get small components rather than large ones. (Redesign sorted_random_qtable_generate() function!!!)
A Literature Review on Quantization Table Design for the JPEG Baseline Algorithm:
Rate-Distortion Approach: may get struck at different local minimal
Human Visual System Approach: not our topic
Genetic Algorithm: a) normal GA should work, why mine doesn't. b) In Identification of the best quantization table using genetic algorithms, why didn't I see compression rate related computation when calculating fitness? c) Pareto dominance to plot a curve rather than a line, but the experiment cost could be huge. d) knowledge based GA (on crossover and mutation). e) Does it count to use/combine existing approachs? Are they too old (mostly before 2013, while the lastest KBGA is 2016)?
Differential evolution is said to be better than GA, but how to get the paper?
Particle Swarm Optimization, Fireflys
Dec 1st
Why do we need to focus on deep learning accuracy rather than PSNR?
Sorted random
Bayesian
Do "good" qtables generalise to other testing set? How about retraining?
Amir's and similar comments I received in the workshop: add JPEG as a different layer and train it. Though I've tried it before, they provided a different edge how to perform quantization.
Instead of multiplying back the quantization, the DCT coefficients could be first normalized and then only quantized. Then we have
x = linear_transform(input)
round(weight * x + bias)
So, the weight doesn't need to be qtable!
one problem is at some point we should recover it back, because dct does not keep enough spatial information. Similar issue with YCbCr2RGB conversion.
We don't have high resolution training set. We might have to stick with ImageNet.
We may even abandon JPEG. Other image storage method like png all works.