RFM model for data analysis

Posted by Cosizzle on Fri, 24 Dec 2021 04:15:58 +0100

I Mean RFM model algorithm
Read the corresponding data from the csv file

data=pd.read_csv('./dataset.csv',encoding='ISO-8859-1')

#Read customer information in 2014

data_14=data[data['Order-year']==2014]
data_14

2. Get the corresponding column

data_14 = data_14[['CustomerID','OrderDate','Sales']]
data_14

CustomerID is the user id
OrderDate is the order date
Sales is the sales amount

3. Copy data (to avoid changing the original data when modifying)

customerdf=data_14.copy()
customerdf

4. Set CustomerID as index

customerdf.set_index('CustomerID',drop=True,inplace=True)

5 add transaction times field (convenient for later calculation of F)

customerdf['orders']=1
customerdf

6. Make a pivot table
Last purchase time purchase times total purchase amount

rfmdf=customerdf.pivot_table(
    index=['CustomerID'],
    values=['OrderDate','orders','Sales'],
    aggfunc={
        'OrderDate':'max',
        'orders':'sum',
        'Sales':'sum'
    }
)

aggfunc calculates the maximum value of OrderDate for the setting of each field, orders is the sum of the number of orders, and Sales is the total Sales amount of each user
7. R F M of each user
R can be calculated by subtracting the last purchase time of each user from the same standard. If the standard is the same, the R standard is the same

rfmdf['R']=(rfmdf.OrderDate.max()-rfmdf.OrderDate).dt.days
rfmdf.rename(columns={'Sales':'M','orders':'F'},inplace=True)
rfmdf

rfmdf.OrderDate.max() is the maximum value of the order placing date in the whole year of 2014. This value can be selected arbitrarily. If a value is selected, the calculated value of each column is the standard value
8 user labeling algorithm

Judgment result return string.So that it can be spliced later,Integer numbers cannot be spliced
def rfm_func(x):
#The difference from the mean value is set to 0 and 1
    res=x.apply(lambda x:'1'if x>0 else '0')
    label=res.R+res.F+res.M
    d={
        '011':'Important value customers', 
        '111':'Important call back customers',
        '001':'Important deep ploughing customers',
        '101':'Important retention customers',
        '010':'Potential customers',
        '110':'General maintenance customer',
        '000':'New customers',
        '100':'Lost customers'
    }
    result=d[label]
    return result

9. Connect RFMS, make difference between the average value of each RFM, and then label each user with a function

rfmdf['label']=rfmdf[['R','F','M']].apply(lambda x:x-x.mean()).apply(rfm_func,axis=1)

10. Calculate the number of each customer type

rfmdf.groupby('label').count()

11. Plot each customer type as abscissa and the value of customer type, i.e. quantity, as ordinate

rfmdf.label.value_counts().plot.bar(figsize=(20,8),fontsize=15)

II. RFM scoring algorithm
Note: the mean value algorithm reduces the error of the maximum and minimum values, so that the whole statistical error becomes larger
The scoring algorithm is to score each user according to a certain interval, so as to improve the overall statistical accuracy

1. Basic information of statistical table

customer_grade_df = rfmdf[['R','F','M']]
customer_grade_df.describe()

2. Set scoring range

#1.1 score the range of F value. The larger the F value, the higher the score
section_list_F=[0,5,10,15,20,50]
grade_F=pd.cut(customer_grade_df['F'],bins=section_list_F,labels=[1,2,3,4,5])
grade_F
customer_grade_df['F_S']=grade_F.values
#1.2 score the range of M value. The larger the M value, the higher the score
section_list_M=[0,500,1000,5000,10000,30000]
grade_M=pd.cut(customer_grade_df['M'],bins=section_list_M,labels=[1,2,3,4,5])
grade_M
customer_grade_df['M_S']=grade_M.values
#1.3 score the range of R value. The smaller the R value, the higher the score
section_list_R=[-1,32,93,186,270,365]
grade_R=pd.cut(customer_grade_df['R'],bins=section_list_R,labels=[5,4,3,2,1])
grade_R
customer_grade_df['R_S']=grade_R.values
customer_grade_df

Note: the interval of each RFM is set according to the maximum and minimum values in the above table. Be sure to include the values of the field, otherwise the whole statistical value will be error

3. View the basic information of the table

customer_grade_df.describe()
#You can find R_S,F_S,M_S is not a floating point number, but a category type 
#Calculation cannot be performed and conversion is required. This is caused by the cut function of pandas library
customer_grade_df.info()

4. Modify the category type to int64

rfm_score_grade=pd.DataFrame(customer_grade_df[['R_S','F_S','M_S']],dtype=np.int64)
rfm_score_grade.dtypes

5. Label the user according to the scoring situation (the same as the mean algorithm)

def rfm_score_func(x):
    level=x.apply(lambda x: '1'if x>=0 else '0')
    label = level.R_S+level.F_S+level.M_S
    d={
        '011':'Important value customers', 
        '111':'Important call back customers',
        '001':'Important deep ploughing customers',
        '101':'Important retention customers',
        '010':'Potential customers',
        '110':'General maintenance customer',
        '000':'New customers',
        '100':'Lost customers'
    }
    result=d[label]
    return result
rfm_score_grade['RFM']=rfm_score_grade[['R_S','F_S','M_S']].apply(lambda x:x-x.mean()).apply(rfm_score_func,axis=1 )
rfm_score_grade

6. Calculate the number of customers of each type

rfm_score_grade.groupby('RFM').count()

7. Drawing

rfm_score_grade.RFM.value_counts().plot.bar(figsize=(20,8),fontsize=16)

Three analogy mean algorithm chart and scoring algorithm chart, it can be seen that the mean error is large

Topics: Data Analysis

Programmer Think

RFM model for data analysis

Hot Topics