A brief introduction to Embedding

Posted on 2022-01-23 Edited on 2023-05-10 In Machine Learning

Embedding a technique widely used in Recommender System. The first time I met it, it confuesd me for a long time. Now let me take 5 minutes to tell you what is Embedding. It will be just a brief introduction without going to details.

In the field of Compuer Vision, all the data you get is labelled images. It is easy to encode an image in computer, like RGB. However, for Recommender System, what you get is 'user id', 'historial behavior', 'item name', 'item category'. Your data may look like that:

User id	Historial behavior (what he bought)	Item name (recommended item)	Item category
Jack	Sweater, Shoes, Watch, Banana	Apple	Fruit
John	Banana, Apple, Watch	Sweater	Clothes
Steven	Computer, Credit Card,	Banana	Fruit
Bob	Sweater, Book, Pen, Banana	Shoes	Clothes

(We may also need the category information for items in 'historial behavior' in read Recommender.)

But string type can not be feed to nural net. What can we do? A simple way is one-hot.

After one-hot, item name may look like that:

item category
1000 (Apple)
0100 (Sweater)
0010 (Banana)
0001 (Shoes)

Well, it seems one-hot has already solved such problem. So why should we introduce embedding?

The most important reason is one-hot embedding ignores inner relation between each value. Such inner relation is also called as high-order information. Eg. "Apple" is obviously related to "Banana" but not "Shoes". But "Apple" x "Banana" = (1,0,0,0) x (0,0,1,0) = 0; "Apple" x "Shoes" = (1,0,0,0) x (0,0,0,1) = 0. One-hot encoding can not present the relation between "Apple" and "Shoes".

Thus, we introduce a single layer of Full-Connnect layer to learn such inner relation. Well, this Full-Connect layer is exactly embedding.

What is the difference between embedding and FC layer?

From the view of network, they are exactly same. But since the input of this FC layer is one-hot encoded, we do not need to really implement a FC layer which may be costly. Instead, we implement a look-up table. The original string input can look up its value in this look-up table via its index in word dictionary. This look-up table is trainable and performance just like FC layer. So the truth is: no embedding, just one-hot.

How to embedding sequence information like historial behavior?

One way is you can embed sequence input to sequence output. Your net may be modified to fit such input with various sequence length. Another way is to fuse sequence input to one via mean or something like it.

The final question is how to build an embedding look-up table?

The first step is to collect the word dictionary. You have to know how many words you have.

The second step is one-hot encoding. You may hash input first so that you can reduce the number of values to be encoded.

The final step is to construct a trainable look-up table. The init value is random.

对embedding的补充：数学上embedding是将一个扁宽的矩阵（行少列多）去乘以输入向量->数据降维、也可以理解为提取数据背后隐藏的要素。由于数据向量往往是稀疏的（在推荐系统中），特别是one-hot编码后的，所以实际上在计算矩阵乘法时可以只计算特定几个数值即可，因此框架实现了embedding（一层无偏置的全连接层）或者embedding lookup（通过直接查表的方式获取嵌入向量，只需寻址即可获得）。所谓word vector本质就是one-hot+embedding。one-hot编码方式割裂了word和word之间的相互联系，完全独立（由于one-hot编码出的vector都相互正交），同时随着词袋的扩大，one-hot编码也会越发庞大。word vector通过降维得到密集矩阵，同时让相互独立的向量之间有了相互的联系。