Domain Specific Units¶

Many researchers pointed out that DCNNs progressively compute more powerful feature detectors as depth increases REF{Mallat2016}. The authors from REF{Yosinski2014} and REF{Hongyang2015} demonstrated that feature detectors that are closer to the input signal (called low level features) are base features that resemble Gabor features, color blobs, edge detectors, etc. On the other hand, features that are closer to the end of the neural network (called high level features) are considered to be more task specific and carry more discriminative power.

In first insights section we observed that the feature detectors from \(\mathcal{D}^s\) (VIS) have some discriminative power over all three target domains we have tested; with VIS-NIR being the “easiest” ones and the VIS-Thermal being the most challenging ones. With such experimental observations, we can draw the following hypothesis:

Note

Given \(X_s=\{x_1, x_2, ..., x_n\}\) and \(X_t=\{x_1, x_2, ..., x_n\}\) being a set of samples from \(\mathcal{D}^s\) and \(\mathcal{D}^t\), respectively, with their correspondent shared set of labels \(Y=\{y_1, y_2, ..., y_n\}\) and \(\Theta\) being all set of DCNN feature detectors from \(\mathcal{D}^s\) (already learnt), there are two consecutive subsets: one that is domain textbf{dependent}, \(\theta_t\), and one that is domain textbf{independent}, \(\theta_s\), where \(P(Y|X_s, \Theta) = P(Y|X_t, [\theta_s, \theta_t])\). Such \(\theta_t\), that can be learnt via back-propagation, is so called Domain Specific Units.

A possible assumption one can make is that \(\theta_t\) is part of the set of low level features, directly connected to the input signal. In this paper we test this assumption. Figure bellow presents a general schematic of our proposed approach. It is possible to observe that each image domain has its own specific set of feature detectors (low level features) and they share the same face space (high level features) that was previously learnt using VIS.

Our approach consists in learning \(\theta_t\), for each target domain, jointly with the DCNN from the source domain. In order to jointly learn \(\theta_t\) with \(D_s\) we propose two different architectural arrangements described in the next subsections.

Siamese DSU¶

In the architecture described below, \(\theta_t\) is learnt using Siamese Neural Networks REF{Chopra2005}. During the forward pass, Figure (a), a pair of face images, one for each domain (either sharing the same identity or not), is passed through the DCNN. The image from the source domain is passed through the main network (the one at the top in Figure (a)) and the image from the target domain is passed first to its domain specific set of feature detectors and then amended to the main network. During the backward pass, Figure (b), errors are backpropagated only for \(\theta^t\). With such structure only a small subset of feature detectors are learnt, reducing the capacity of the joint model. The loss \(\mathcal{L}\) is defined as:

\(\mathcal{L}(\Theta) = 0.5\Bigg[ (1-Y)D(x_s, x_t) + Y \max(0, m - D(x_s, x_t))\Bigg]\), where \(m\) is the contrastive margin, \(Y\) is the label (1 when \(x_s\) and \(x_t\) belong to the same subject and 0 otherwise) and \(D\) is defined as:

\(D(x_s, x_t) = || \phi(x_s) - \phi(x_t)||_{2}^{2}\), where \(\phi\) are the embeddings from the jointly trained DCNN.

Results¶

Warning

Decribe the results from the paper

Understanding the Domain Specific Units¶

In this section we break down the covariate distribution of data points sensed in different image modalities layer by layer using tSNEs.

With these plots we expect to observe how data from different image modalities and different identities are organized along the DCNN transformations.

For each image domain we present:

The covariate distribution using the base network as a reference (without any adaptation) in the left column.
The covariate distribution using the best DSU adapted network (for each database) in the right column.

For this analysis we make all the plots using the Inception Resnet v2 as a basis.

Pola Thermal¶

For this analysis, the columns on the right are generated using the \(\theta_{t[1-4]}\) DSU.

Pixel level distribution¶

Below we present the tSNE covariate distribution using the pixels as input. Blue dots represent VIS samples and red dots represent Thermal samples. It’s possible to observe that images from different image modalities do cluster, which is an expected behaviour.

Conv2d_1a_3x3 (\(\theta_{t[1-1]}\) DSU adapted)¶

Below we present the tSNE covariate distribution using the output of the first layer as input (\(\theta_{1-1}\)). We can observe that in the very first layer the identities are clustered for both, adapted and non adapted, DCNNs. Moreover, the image modalities form two “big” clusters.

../_images/THERMAL_NOadapt_1-4_1_flat.png

Conv2d_3b_1x1 (\(\theta_{t[1-2]}\) DSU adapted)¶

Below we present the tSNE covariate distribution using the output of the first layer as input (\(\theta_{1-2}\)). We can observe that in the very first layer the identities are clustered for both, adapted and non adapted, DCNNs. Moreover, the image modalities form two “big” clusters.

../_images/THERMAL_NOadapt_1-4_2_flat.png

Conv2d_4a_3x3 (\(\theta_{t[1-4]}\) DSU adapted)¶

Below we present the tSNE covariate distribution using the output of the first layer as input (\(\theta_{1-4}\)). We can observe that in the very first layer the identities are clustered for both, adapted and non adapted, DCNNs. This is the last adapted layer for this setup and the image modalities are still organized in two different clusters, which is a behaviour that, at first glance, is not expected.

../_images/THERMAL_NOadapt_1-4_4_flat.png

Mixed_5b (\(\theta_{t[1-5]}\))¶

From now, the layers are not DSU adapted. Below we can observe the same behaviour as before. Modalities are clustered in two “big” clusters and inside of these clusters, the identities are clustered.

../_images/THERMAL_NOadapt_1-4_5b_flat.png

../_images/THERMAL_adapt_1-4_5b_flat.png

Mixed_6a (\(\theta_{t[1-6]}\))¶

Below we can observe the same behaviour as before. Modalities are clustered in two “big” clusters and inside of these clusters, the identities are clustered.

../_images/THERMAL_NOadapt_1-4_6a_flat.png

../_images/THERMAL_adapt_1-4_6a_flat.png

Mixed_7a¶

Below we can observe the same behaviour as before. Modalities are clustered in two “big” clusters and inside of these clusters, the identities are clustered.

../_images/THERMAL_NOadapt_1-4_7a_flat.png

../_images/THERMAL_adapt_1-4_7a_flat.png

Conv2d_7b_1x1¶

In the left tSNE (non DSU), we can observe the same behaviour as before. However, in the tSNE on the right we can observe that images from the same identities, but different image modalities start to cluster.

../_images/THERMAL_NOadapt_1-4_7b_flat.png

../_images/THERMAL_adapt_1-4_7b_flat.png

PreLogitsFlatten¶

In the left tSNE (non DSU), we can observe the same behaviour as before. However, in the tSNE on the right we can observe that images from the same identities, but different image modalities start to cluster. We can use this layer as the final embedding.

../_images/THERMAL_NOadapt_1-4_prelog_flat.png

../_images/THERMAL_adapt_1-4_prelog_flat.png

Final Embedding¶

In the left tSNE (non DSU), we can observe the same behaviour as before. However, in the tSNE on the right we can observe that images from the same identities, but different image modalities start to cluster. We can use this layer as the final embedding.

../_images/THERMAL_NOadapt_1-4_emb_flat.png

../_images/THERMAL_adapt_1-4_emb_flat.png

CUFSF¶

For this analysis, the columns on the right is generated using the \(\theta_{t[1-5]}\) DSU.

Pixel level distribution¶

Below we present the tSNE covariate distribution using the pixels as input. Blue dots represent VIS samples and red dots represents Thermal samples. It’s possible to observe that images from different image modalities do cluster, which is an expected behaviour.

Conv2d_1a_3x3 (\(\theta_{t[1-1]}\) DSU adapted)¶

Below we present the tSNE covariate distribution using the output of the first layer as input (\(\theta_{1-1}\)). We can observe that in the very first layer the identities are clustered (of course they are clustered, we have only one sample per identity/modality) for both, adapted and non adapted, DCNNs. Moreover, the image modalities form two “big” clusters.

Conv2d_3b_1x1 (\(\theta_{t[1-2]}\) DSU adapted)¶

Below we present the tSNE covariate distribution using the output of the first layer as input (\(\theta_{1-2}\)). We can observe that in the very first layer the identities are clustered (of course they are clustered, we have only one sample per identity/modality) for both, adapted and non adapted, DCNNs. Moreover, the image modalities form two “big” clusters.

Conv2d_4a_3x3 (\(\theta_{t[1-4]}\) DSU adapted)¶

Below we present the tSNE covariate distribution using the output of the first layer as input (\(\theta_{1-4}\)). We can observe that in the very first layer the identities are clustered (of course they are clustered, we have only one sample per identity/modality) for both, adapted and non adapted, DCNNs.

Mixed_5b (\(\theta_{t[1-5]}\) DSU adapted)¶

From now, the layers are not DSU adapted. Below we can observe the same behaviour as before. This is the last adapted layer for this setup and the image modalities are still organized in two different clusters, which is a behaviour that, at first glance, is not expected.

Mixed_6a (\(\theta_{t[1-6]}\))¶

Below we can observe the same behaviour as before. Modalities are clustered in two “big” clusters and inside of these clusters, the identities are clustered.

../_images/CUFSF_NOadapt_1-5_6a_flat.png

Mixed_7a¶

Below we can observe the same behaviour as before. Modalities are clustered in two “big” clusters and inside of these clusters, the identities are clustered.

../_images/CUFSF_NOadapt_1-5_7a_flat.png

Conv2d_7b_1x1¶

In the left tSNE (non DSU), we can observe the same behaviour as before. However, in the tSNE on the right we can observe that images from the same identities, but different image modalities start to cluster.

../_images/CUFSF_NOadapt_1-5_7b_flat.png

PreLogitsFlatten¶

In the left tSNE (non DSU), we can observe the same behaviour as before. However, in the tSNE on the right we can observe that images from the same identities, but different image modalities start to cluster. We can use this layer as the final embedding.

../_images/CUFSF_NOadapt_1-5_prelog_flat.png

../_images/CUFSF_adapt_1-5_prelog_flat.png

Final Embedding¶

In the left tSNE (non DSU), we can observe the same behaviour as before. However, in the tSNE on the right we can observe that images from the same identities, but different image modalities start to cluster. We can use this layer as the final embedding.

../_images/CUFSF_NOadapt_1-5_emb_flat.png

CUHK-CUFS¶

For this analysis, the columns on the right is generated using the \(\theta_{t[1-5]}\) DSU.

Pixel level distribution¶

Below we present the tSNE covariate distribution using the pixels as input. Blue dots represent VIS samples and red dots represents Sketch samples. It’s possible to observe that images from different image modalities do cluster, which is an expected behaviour.

Conv2d_1a_3x3 (\(\theta_{t[1-1]}\) DSU adapted)¶

Below we present the tSNE covariate distribution using the output of the first layer as input (\(\theta_{1-1}\)). We can observe that from this layer, in both cases (left and right), images from different image modalities belongs to the same cluster. It’s not possible to use a linear classifier to classify both modalities. Moreover, the identities from different image modalities seems to form small clusters for some cases. For information, this database has only ONE pair of images sensed in both modalities.