One-key coding (also called virtual variable) is a method to convert a classified variable into several binary columns, where 1 indicates that there are rows belonging to this category. Obviously, from the perspective of machine learning, it is not suitable for coding classified variables.
Obviously, it adds many dimensions, but generally speaking, the smaller the dimension, the better. For example, if a column is set to represent states in the United States (such as California and new york), there will be 50 more dimensions in the single heat coding scheme.
Doing so will not only add many dimensions to the data set, but actually there is not much information-there are several 1 scattered in a large number of 0s. This makes optimization difficult, especially for neural networks, whose optimizer can easily enter the wrong space in a large number of blank dimensions.
To make matters worse, there is a linear relationship between each sparse information column. This means that one variable can be easily predicted by other variables, which may lead to high-dimensional parallelism and multiple * * * linear problems.
The best data set contains the characteristics of independent value of information, while single heat coding can create a completely different environment. Of course, if there are only three or even four classes, then separate hot coding may not be a bad choice. However, depending on the relative size of the data set, other alternatives may be worth exploring.
The object code can effectively represent the classification column, and only takes up one feature space. Also known as mean coding, every value in this column is replaced by the average target value of this category. This can express the relationship between classification variables and target variables more directly, and it is also a very popular technology (especially in Kaggle competition).
This coding method has some disadvantages. First, it makes it more difficult for the model to understand the relationship between the average coded variable and another variable. It can only draw similarity in columns according to the relationship with the target, which has advantages and disadvantages.
This coding method is very sensitive to Y variable, which will affect the ability of the model to extract coding information.
Because every value in this category is replaced by the same numerical value, the model may over-fit the coded value it sees (for example, associating 0.8 with a value completely different from 0.79). This is the result of class processing that treats the values on continuous scales as serious repetitions. Therefore, it is necessary to carefully monitor whether the Y variable has an abnormal value.