1. Project background and analysis
1) Project background
Online shopping has become an indispensable part of people’s life. Based on the data of Taobao app platform, this project analyzes user behavior through relevant indicators, so as to explore user-related behavior patterns.
2) Data and field description
The data set used in this article contains the user behavior data of Taobao App mobile terminal within one month from 2014.11.18 to 2014.12.18. The data has 12256906 days of records, a total of 6 columns of data.
- user_id: user identity
- item_id: item id
- behavior_type: user behavior type (including four behaviors of clicking, collecting, adding to shopping cart, and paying, respectively represented by numbers 1, 2, 3, and 4)
- user_geohash: geographic location
- item_category: category id (category to which the product belongs)
- time: the time when the user behavior occurred
3) Dimensions of analysis
- Analysis of traffic indicators
- User Behavior Analysis
- Funnel Churn Analysis
- RFM Analysis of User Value
4) Common analysis methods for e-commerce
5) What is funnel analysis?
“Funnel analysis” is a set of process data analysis, which can scientifically reflect the state of user behavior, as well as an important analysis model of user conversion rate at each stage from the beginning to the end.
2. Import related libraries
import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt import seaborn as sns import warnings # Set to seaborn drawing style sns.set(style="darkgrid", font_scale=1.5) # Used to display Chinese labels mpl.rcParams["font.family"] = "SimHei" # used to display negative sign mpl.rcParams["axes.unicode_minus"] = False # Sometimes there will be a lot of warning output when running the code, such as reminding the new version, if you don't want these messy output, you can use the following code warnings. filter warnings('ignore')
3. Data preview, data preprocessing
# Note: str is to read all fields as strings df = pd.read_csv("taobao.csv",dtype=str) df.shape df. info() df.sample(5)
The results are as follows:
1) Calculate the missing rate
# Because there are too many missing values for geographic location, we can't fill them in, so delete this column first df.apply(lambda x:sum(x.isnull())/len(x),axis=0)
The results are as follows:
2) Delete the geographic location column
df.drop(["user_geohash"],axis=1,inplace=True)
3) Process the time column and split the column into a date column and an hour column
df["date"] = df.time.str[0:-3] df["hour"] = df.time.str[-2:] df.sample(5)
The results are as follows:
4) Change the time and date columns into standard date format, and change the hour column into int format
df["date"] = pd.to_datetime(df["date"]) df["time"] = pd.to_datetime(df["time"]) df["hour"] = df["hour"].astype(int) df.dtypes
The results are as follows:
5) Arrange the data in ascending order according to the time column
df.sort_values(by="time", ascending=True, inplace=True) df. head()
The solution is as follows:
6) Delete the original index and regenerate a new index
df. reset_index(drop=True, inplace=True) df. head()
The results are as follows:
Knowledge point: Pay attention to the use of the parameter drop in reset_index().
7) Use the describe() function to view the distribution of the data. An include parameter is used here, pay attention
<p��Income/AU” is calculated
total_custome = df[df['behavior_type'] == "4"].groupby(["date","user_id"])["behavior_type"].count()\ .reset_index().rename(columns={"behavior_type":"total"}) total_custome. head() total_custome2 = total_custome.groupby("date").sum()["total"]/\ total_custome.groupby("date").count()["total"] total_custome2. head(10) # draw as follows x = len(total_custome2. index. astype(str)) y = total_custome2.index.astype(str) plt.plot(total_custome2.values) plt.xticks(range(0,30,7),[y[i] for i in range(0,x,7)],rotation=90) plt.title("per capita consumption per day") plt.tight_layout() plt.savefig("Per capita consumption per day", dpi=300) plt. show()
The results are as follows:
The drawing is as follows:
⑥ Daily ARPU analysis: It represents the average revenue per user. ARPU = total revenue/AU received
df["operation"] = 1 aa = df.groupby(["date","user_id",'behavior_type'])["operation"].count().\ reset_index().rename(columns={"operation":"total"}) aa. head(10) aa1 = aa.groupby("date").apply(lambda x: x[x["behavior_type"]=="4"]["total"].sum()/x["user_id"].nunique() ) aa1. head(10) # draw as follows x = len(aa1. index. astype(str)) y = aa1. index. astype(str) plt.plot(aa1.values) plt.xticks(range(0,30,7),[y[i] for i in range(0,x,7)],rotation=90) plt.title("The number of daily active user consumption") plt.tight_layout() plt.savefig("Consumption times of active users per day", dpi=300) plt. show()
The results are as follows:
The drawing is as follows:
⑦ Payment rate PUR = APA/AU, here is replaced by [number of consumers / number of active users]
rate = aa.groupby("date").apply(lambda x: x[x["behavior_type"]=="4"]["total"].count()/x["user_id"]. nunique()) rate. head(10) # draw as follows x = len(rate. index. astype(str)) y = rate. index. astype(str) plt.plot(rate.values) plt.xticks(range(0,30,7),[y[i] for i in range(0,x,7)],rotation=90) plt.title("Payment Rate Analysis") plt.tight_layout() plt.savefig("Payment rate analysis",dpi=300) plt. show()
The results are as follows:
⑧ Analysis of repurchase situation (repurchase rate)
re_buy = df[df["behavior_type"]=="4"].groupby("user_id")["date"].apply(lambda x: x.nunique()) print(len(re_buy)) re_buy[re_buy >= 2].count() / re_buy.count()
The results are as follows:
3) Funnel analysis
df_count = df.groupby("behavior_type").size().reset_index().\ rename(columns={"behavior_type":"link",0:"number of people"}) type_dict = { "1": "Click", "2": "Favorites", "3": "Add to cart", "4":"Payment" } df_count["link"] = df_count["link"].map(type_dict) a = df_count.iloc[0]["number of people"] b = df_count.iloc[1]["number of people"] c = df_count.iloc[2]["number of people"] d = df_count.iloc[3]["number of people"] funnel = pd.DataFrame({"Link":["Click","Favorite and Add to Shopping Cart","Payment"],"Number of People":[a,b+c,d]}) funnel["overall conversion rate"] = [i/funnel["number of people"][0] for i in funnel["number of people"]] funnel["single conversion rate"] = np.array([1.0,2.0,3.0]) for i in range(0,len(funnel["number of people"])): if i == 0: funnel["single conversion rate"][i] = 1.0 else: funnel["single conversion rate"][i] = funnel["number of people"][i] / funnel["number of people"][i-1] # draw as follows import plotly.express as px import plotly.graph_objs as go trace = go. Funnel( y = ["Click", "Favorite and Add to Cart", "Purchase"], x = [funnel["number"][0], funnel["number"][1], funnel["number"][2]], textinfo = "value+percent initial", marker=dict(color=["deepskyblue", "lightsalmon", "tan"]), connector = {"line": {"color": "royalblue", "dash": "solid", "width": 3}}) data = [trace] fig = go. Figure(data) fig. show()
The results are as follows:
The drawing is as follows:
Result analysis: Since collecting and adding to car purchase are both user behaviors with purchase intentions, regardless of the order, we regard their combination as a stage. From the above funnel chart and funnel table, it can be seen that the conversion rate from browsing to purchase intention (collection and adding to shopping cart) is only 5%, but the conversion rate to actual purchase is only 1%. Look at “single conversion” rate”, the conversion rate from purchase intention to actual purchase has reached 20%. Explain that the stage from browsing to bookmarking and adding to the shopping cart is an important link for indicator improvement.
4) Customer value analysis (RFM analysis)
from datetime import datetime # The number of days since the last purchase recent_buy = df[df["behavior_type"]=="4"].groupby("user_id")["date"].\ apply(lambda x:datetime(2014,12,20) - x.sort_values().iloc[-1]).reset_index().\ rename(columns={"date":"recent"}) recent_buy["recent"] = recent_buy["recent"].apply(lambda x: x.days) recent_buy[:10] # Calculate the number of purchases buy_freq = df[df["behavior_type"]=="4"].groupby("user_id")["date"].count().reset_index().\ rename(columns={"date":"freq"}) buy_freq[:10] # Merge the above two columns of data rfm = pd.merge(recent_buy,buy_freq,on="user_id") rfm[:10] # Score different types r_bins = [0,5,10,15,20,50] f_bins = [1,30,60,90,120,900] rfm["r_score"] = pd. cut(rfm["recent"],bins=r_bins,labels=[5,4,3,2,1],right=False) rfm["f_score"] = pd. cut(rfm["freq"],bins=f_bins,labels=[1,2,3,4,5],right=False) for i in ["r_score","f_score"]: rfm[i] = rfm[i].astype(float) rfm. describe() # Compare the size of each score with its respective mean rfm["r"] = np.where(rfm["r_score"]>3.943957,"high","low") rfm["f"] = np.where(rfm["f_score"]>1.133356,"high","low") # Merge the strings in columns r and f rfm["value"] = rfm["r"].str[:] + rfm["f"].str[:] rfm. head() # Custom function to label users def trans_labels(x): if x == "high": return "Important Value Customer" elif x == "Low High": return "Important recall customer" elif x == "high and low": return "Important deep cultivation customers" else: return "important to save customers" rfm["labels"] = rfm["value"].apply(trans_labels) # Calculate the number of users for each tag rfm["label"].value_counts()
The results are as follows:
Source code acquisition group: 850591259
2MtaW1hZ2UvMTBiYjBiMTZjZjU4NDBiNTk4M2YzMWM2MDZkY2IzMWQ?x-oss-process=image/format,png”>
The drawing is as follows:
Result analysis: Since collecting and adding to car purchase are both user behaviors with purchase intentions, regardless of the order, we regard their combination as a stage. From the above funnel chart and funnel table, it can be seen that the conversion rate from browsing to purchase intention (collection and adding to shopping cart) is only 5%, but the conversion rate to actual purchase is only 1%. Look at “single conversion” rate”, the conversion rate from purchase intention to actual purchase has reached 20%. Explain that the stage from browsing to bookmarking and adding to the shopping cart is an important link for indicator improvement.
4) Customer value analysis (RFM analysis)
from datetime import datetime # The number of days since the last purchase recent_buy = df[df["behavior_type"]=="4"].groupby("user_id")["date"].\ apply(lambda x:datetime(2014,12,20) - x.sort_values().iloc[-1]).reset_index().\ rename(columns={"date":"recent"}) recent_buy["recent"] = recent_buy["recent"].apply(lambda x: x.days) recent_buy[:10] # Calculate the number of purchases buy_freq = df[df["behavior_type"]=="4"].groupby("user_id")["date"].count().reset_index().\ rename(columns={"date":"freq"}) buy_freq[:10] # Merge the above two columns of data rfm = pd.merge(recent_buy,buy_freq,on="user_id") rfm[:10] # Score different types r_bins = [0,5,10,15,20,50] f_bins = [1,30,60,90,120,900] rfm["r_score"] = pd. cut(rfm["recent"],bins=r_bins,labels=[5,4,3,2,1],right=False) rfm["f_score"] = pd. cut(rfm["freq"],bins=f_bins,labels=[1,2,3,4,5],right=False) for i in ["r_score","f_score"]: rfm[i] = rfm[i].astype(float) rfm. describe() # Compare the size of each score with its respective mean rfm["r"] = np.where(rfm["r_score"]>3.943957,"high","low") rfm["f"] = np.where(rfm["f_score"]>1.133356,"high","low") # Merge the strings in columns r and f rfm["value"] = rfm["r"].str[:] + rfm["f"].str[:] rfm. head() # Custom function to label users def trans_labels(x): if x == "high": return "Important Value Customer" elif x == "Low High": return "Important recall customer" elif x == "high and low": return "Important deep cultivation customers" else: return "important to save customers" rfm["labels"] = rfm["value"].apply(trans_labels) # Calculate the number of users for each tag rfm["label"].value_counts()
The results are as follows:
Source code acquisition group: 850591259