1. Go deep into the principle and practice of FFM
From meituan technical team, In depth FFM principle and Practice
FM and FFM models are proposed in recent years. With the characteristics that they can still obtain excellent performance and effect in the case of large amount of data and sparse features, they have achieved good results in CTR estimation competitions held by major companies for many times. In the process of building DSP, meituan technical team explored and used FM and FFM models to predict CTR and CVR, and achieved good results.
- After one hot coding, most of the sample data features are sparse.
- Another feature of one hot coding is that it leads to large feature space.
At the same time, by observing a large number of sample data, it can be found that the correlation between some features and label will be improved after correlation. FFM is mainly used to estimate CTR and CVR in the station, that is, the potential click through rate and post click conversion rate of a user to a commodity.
Both CTR and CVR prediction models are trained offline and then used for online prediction. The features adopted by the two models are similar, mainly including three types: user related features, commodity related features, and user commodity matching features. User related features include basic information such as age, gender, occupation, interest, category preference, browsing / purchasing category, as well as statistical information such as recent hits, purchases and consumption of users. Commodity related features include category, sales volume, price, score, historical CTR/CVR and other information. User commodity matching features mainly include browsing / purchasing category matching, browsing / purchasing merchant matching, interest preference matching and so on.
In order to use the FFM method, all features must be converted to the format "field_id:feat_id:value", field_id represents the number of the field to which the feature belongs, feat_id is the feature number and value is the value of the feature
Numerical features are easy to handle. You only need to assign a separate field number, such as user comment score, historical CTR/CVR of goods, etc. Category features need to be coded into numerical type through one hot. All features generated by the coding belong to the same field, and the value of the feature can only be 0 or 1, such as the user's gender, age, category id of the commodity, etc. In addition, there is a third type of feature, such as the user browsing / purchasing category. There are multiple category IDS, and a numerical value is used to measure the number of products browsed or purchased by the user in each category. Such features are processed according to the category feature, except that the value of the feature is not 0 or 1, but the value representing the number of users browsing or purchasing. Get the field as described above_ After id, the transformed features are sequentially numbered to obtain feat_id, the value of the feature can also be obtained according to the previous method.
2 cases
The code case refers to: wangru8080/gbdt-lr
Among them, FFM uses libffm library for training. The code only gives the method of constructing data input (FFMFormat). After constructing the input format, you can directly use libffm for training.
The training format required by libffm here is quite special:
label field_id:feature_id:value field_id:feature_id:value field_id:feature_id:value ...
- field_id represents the id number of each feature field
- feature_id represents the id number of all eigenvalues (continuous coding and hash coding can be adopted)
- Value: when the feature field is not a continuous feature, value=1. If it is a continuous feature, value = the value of the feature
For pandas DataFrame format data:
label category_feature continuous_feature vector_feature ===== ================ ================== ============== 0 x 1.1 1 2 1 y 1.2 3 4 5 0 x 2.2 6 7 8 9
This article only has category_feature,continuous_feature,vector_feature.
wangru8080/gbdt-lr In, the code of data conversion is:
def FFMFormat(df, label, path, train_len, category_feature = [], continuous_feature = []): index = df.shape[0] train = open(path + 'train.ffm', 'w') test = open(path + 'test.ffm', 'w') feature_index = 0 feat_index = {} for i in range(index): feats = [] field_index = 0 for j, feat in enumerate(category_feature): t = feat + '_' + str(df[feat][i]) if t not in feat_index.keys(): feat_index[t] = feature_index feature_index = feature_index + 1 feats.append('%s:%s:%s' % (field_index, feat_index[t], 1)) field_index = field_index + 1 for j, feat in enumerate(continuous_feature): feats.append('%s:%s:%s' % (field_index, feature_index, df[feat][i])) feature_index = feature_index + 1 field_index = field_index + 1 print('%s %s' % (df[label][i], ' '.join(feats))) if i < train_len: train.write('%s %s\n' % (df[label][i], ' '.join(feats))) else: test.write('%s\n' % (' '.join(feats))) train.close() test.close()
The leaf node data after LightGBM is discrete data,
3 Kaggle: Pandas to libffm
website: https://www.kaggle.com/mpearmain/pandas-to-libffm
import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) from sklearn.datasets import make_classification ''' Another CTR comp and so i suspect libffm will play its part, after all it is an atomic bomb for this kind of stuff. A sci-kit learn inspired script to convert pandas dataframes into libFFM style data. The script is fairly hacky (hey thats Kaggle) and takes a little while to run a huge dataset. The key to using this class is setting up the features dtypes correctly for output (ammend transform to suit your needs) Example below ''' class FFMFormatPandas: def __init__(self): self.field_index_ = None self.feature_index_ = None self.y = None def fit(self, df, y=None): self.y = y df_ffm = df[df.columns.difference([self.y])] if self.field_index_ is None: self.field_index_ = {col: i for i, col in enumerate(df_ffm)} if self.feature_index_ is not None: last_idx = max(list(self.feature_index_.values())) if self.feature_index_ is None: self.feature_index_ = dict() last_idx = 0 for col in df.columns: vals = df[col].unique() for val in vals: if pd.isnull(val): continue name = '{}_{}'.format(col, val) if name not in self.feature_index_: self.feature_index_[name] = last_idx last_idx += 1 self.feature_index_[col] = last_idx last_idx += 1 return self def fit_transform(self, df, y=None): self.fit(df, y) return self.transform(df) def transform_row_(self, row, t): ffm = [] if self.y != None: ffm.append(str(row.loc[row.index == self.y][0])) if self.y is None: ffm.append(str(0)) for col, val in row.loc[row.index != self.y].to_dict().items(): col_type = t[col] name = '{}_{}'.format(col, val) if col_type.kind == 'O': ffm.append('{}:{}:1'.format(self.field_index_[col], self.feature_index_[name])) elif col_type.kind == 'i': ffm.append('{}:{}:{}'.format(self.field_index_[col], self.feature_index_[col], val)) return ' '.join(ffm) def transform(self, df): t = df.dtypes.to_dict() return pd.Series({idx: self.transform_row_(row, t) for idx, row in df.iterrows()}) ########################### Lets build some data and test ############################ ### train, y = make_classification(n_samples=100, n_features=5, n_informative=2, n_redundant=2, n_classes=2, random_state=42) train=pd.DataFrame(train, columns=['int1','int2','int3','s1','s2']) train['int1'] = train['int1'].map(int) train['int2'] = train['int2'].map(int) train['int3'] = train['int3'].map(int) train['s1'] = round(np.log(abs(train['s1'] +1 ))).map(str) train['s2'] = round(np.log(abs(train['s2'] +1 ))).map(str) train['clicked'] = y ffm_train = FFMFormatPandas() ffm_train_data = ffm_train.fit_transform(train, y='clicked') print('Base data') print(train[0:10]) print('FFM data') print(ffm_train_data[0:10])