1. 文章基本信息
- 文章标题:Customer Acquisition via Display Advertising Using Multi - Armed Bandit Experiments
- 作者:Eric M. Schwartz、Eric T. Bradlow、Peter S. Fader
- 发表期刊:Marketing Science 2017, Vol. 36(4), pp. 500 - 522
- 文章主旨:将在线展示广告中的客户获取问题转化为多臂老虎机(MAB)问题,提出一种分层、基于属性、批量的MAB策略,并通过大规模实地实验和反事实模拟,与多种MAB策略比较,分析不同策略在不同情境下的性能表现,探讨其对营销实践和理论的贡献以及局限性和未来研究方向。
- Main Idea of the Article: To transform the customer acquisition problem in online display advertising into a multi - armed bandit (MAB) problem, propose a hierarchical, attribute - based, batched MAB policy, and through a large - scale field experiment and counterfactual simulations, compare with various MAB strategies, analyze the performance of different strategies in different situations, and discuss its contributions to marketing practice and theory as well as limitations and future research directions.
# 多臂老虎机(Multi-Armed Bandit, MAB)问题的代码
import numpy as np
import matplotlib.pyplot as plt
# 定义多臂老虎机环境
class Bandit:
def __init__(self, k):
self.k = k
self.probs = np.random.rand(k)
def pull(self, arm):
return np.random.rand() < self.probs[arm]
# ε-贪婪算法
class EpsilonGreedy:
def __init__(self, k, epsilon):
self.k = k
self.epsilon = epsilon
self.counts = np.zeros(k)
self.values = np.zeros(k)
def select_arm(self):
if np.random.rand() < self.epsilon:
return np.random.randint(self.k)
else:
return np.argmax(self.values)
def update(self, arm, reward):
self.counts[arm] += 1
n = self.counts[arm]
value = self.values[arm]
new_value = ((n - 1) / n) * value + (1 / n) * reward
self.values[arm] = new_value
# UCB算法
class UCB:
def __init__(self, k):
self.k = k
self.counts = np.zeros(k)
self.values = np.zeros(k)
self.total_counts = 0
def select_arm(self):
if 0 in self.counts:
return np.argmin(self.counts)
ucb_values = self.values + np.sqrt(2 * np.log(self.total_counts) / self.counts)
return np.argmax(ucb_values)
def update(self, arm, reward):
self.counts[arm] += 1
self.total_counts += 1
n = self.counts[arm]
value = self.values[arm]
new_value = ((n - 1) / n) * value + (1 / n) * reward
self.values[arm] = new_value
# 汤普森采样算法
class ThompsonSampling:
def __init__(self, k):
self.k = k
self.successes = np.zeros(k)
self.failures = np.zeros(k)
def select_arm(self):
samples = [np.random.beta(self.successes[i] + 1, self.failures[i] + 1) for i in range(self.k)]
return np.argmax(samples)
def update(self, arm, reward):
if reward == 1:
self.successes[arm] += 1
else:
self.failures[arm] += 1
# 运行实验
def run_experiment(bandit, agent, steps):
rewards = np.zeros(steps)
for step in range(steps):
arm = agent.select_arm()
reward = bandit.pull(arm)
agent.update(arm, reward)
rewards[step] = reward
return rewards
# 设置参数
k = 10
steps = 1000
# 创建多臂老虎机环境
bandit = Bandit(k)
# 创建并运行ε-贪婪算法
epsilon_greedy_agent = EpsilonGreedy(k, epsilon=0.1)
epsilon_greedy_rewards = run_experiment(bandit, epsilon_greedy_agent, steps)
# 创建并运行UCB算法
ucb_agent = UCB(k)
ucb_rewards = run_experiment(bandit, ucb_agent, steps)
# 创建并运行汤普森采样算法
thompson_sampling_agent = ThompsonSampling(k)
thompson_sampling_rewards = run_experiment(bandit, thompson_sampling_agent, steps)
# 绘制结果
plt.figure(figsize=(12, 8))
plt.plot(np.cumsum(epsilon_greedy_rewards), label='Epsilon-Greedy')
plt.plot(np.cumsum(ucb_rewards), label='UCB')
plt.plot(np.cumsum(thompson_sampling_rewards), label='Thompson Sampling')
plt.xlabel('Steps')
plt.ylabel('Cumulative Reward')
plt.legend()
plt.show()
2. 研究背景与问题提出
- 营销中的实验与学习 - 盈利权衡
- 企业在营销中常面临学习与盈利的权衡,传统方法是先测试再推广,但应同时混合学习与盈利。在线广告领域常进行实验,但现有MAB方法无法完全解决其丰富性问题。
- 本文聚焦在线展示广告的广告主资源分配问题,通过顺序学习广告效果来优化资源分配,同时测试提出的MAB政策在实时随机对照试验中的有效性,并从方法学角度提出新的MAB策略。
- Experiment and Learn - Earn Trade - off in Marketing
- Firms often face a learn - earn trade - off in marketing. The traditional method is to test first and then roll out, but they should simultaneously mix learning and earning. In the online advertising field, experiments are often conducted, but existing MAB methods cannot fully address its complexity.
- This paper focuses on the advertiser's resource allocation problem in online display advertising, optimizing resource allocation by sequentially learning about ad performance, and simultaneously testing the effectiveness of the proposed MAB policy in a real - time randomized controlled trial. From a methodological perspective, a new MAB strategy is also proposed.
- MAB问题的提出
- 将广告主的问题转化为MAB问题,其特点包括基于属性的行动、批量决策以及不同情境下预期回报和属性重要性的异质性。定义了MAB政策的目标是最大化期望获得的客户总数,同时阐述了相关符号和概念。
- 动态规划问题存在维度诅咒,部分条件可放松,但本文案例中使索引解最优的假设不成立,而汤普森采样(TS)是一种灵活且可行的MAB方法,理论上具有多种优势,因此本文采用TS并结合分层广义线性模型(HGLM)提出一种MAB政策。
- Formulation of the MAB Problem
- The advertiser's problem is transformed into an MAB problem, which is characterized by attribute - based actions, batched decision - making, and heterogeneity in expected rewards and attribute importance across different contexts. The objective of the MAB policy is defined as maximizing the expected total number of acquired customers, and relevant symbols and concepts are also described.
- The dynamic programming problem suffers from the curse of dimensionality. Although some conditions can be relaxed, the assumptions that make the index solution optimal do not hold in this case. Thompson Sampling (TS) is a flexible and feasible MAB method with theoretical advantages. Therefore, this paper adopts TS and combines it with a Hierarchical Generalized Linear Model (HGLM) to propose an MAB policy.
3. 实验设计与数据收集
- 实地实验设置
- 与一家美国金融服务公司合作进行实地实验,实验为期62天,涉及多种广告概念、尺寸和媒体投放。实验产生了532个单位的观测数据,包括广告展示次数、点击次数和转化次数等信息。
- 通过分析发现,不同媒体投放的转化率差异大于广告之间的差异,且不同媒体投放的转化率存在相关性,同时广告概念和媒体投放之间存在交互作用。
- Field Experiment Setup
- A field experiment was conducted in collaboration with an American financial services company. The experiment lasted for 62 days and involved multiple ad concepts, sizes, and media placements. The experiment generated 532 units of observational data, including information such as the number of ad impressions, clicks, and conversions.
- Through analysis, it was found that the heterogeneity in conversion rates across media placements is greater than that across ads, and there is a correlation between conversion rates of different media placements. There is also an interaction between ad concepts and media placements.
- 将问题形式化为MAB
- 定义了广告、展示次数、转化次数等相关变量,广告转化率是未知且可能相关的,MAB问题基于属性,同时考虑了批量决策和不同网站上广告属性重要性的差异,用通用广义线性模型描述广告属性对转化率的影响。
- 阐述了优化问题,即通过选择合适的政策来最大化期望获得的客户总数,同时提到动态规划问题的困难以及TS作为一种替代方法的优势。
- Formulating the Problem as an MAB
- Relevant variables such as ads, impressions, and conversions are defined. The ad conversion rates are unknown and may be correlated. The MAB problem is attribute - based, and batch decision - making and differences in ad attribute importance across websites are considered. A general generalized linear model is used to describe the impact of ad attributes on conversion rates.
- The optimization problem of maximizing the expected total number of acquired customers by selecting an appropriate policy is described. The difficulties of the dynamic programming problem and the advantages of TS as an alternative method are also mentioned.
4. 提出的MAB策略
- 汤普森采样与分层广义线性模型结合
- 将广告分配问题概念化为分层、基于属性、批量的MAB问题,提出的MAB政策是HGLM和TS的结合。HGLM是一个具有不同网站参数的逻辑回归模型,考虑了网站间的未观察到的异质性;TS通过从后验分布中抽样来编码模型不确定性。
- 详细描述了基于TS和HGLM的转换模型,以及如何利用TS计算每个网站的广告分配概率,该过程涉及对后验概率的计算和模拟抽样近似。
- Combination of Thompson Sampling and Hierarchical Generalized Linear Model
- The ad allocation problem is conceptualized as a hierarchical, attribute - based, batched MAB problem. The proposed MAB policy combines HGLM and TS. HGLM is a logistic regression model with different parameters across websites, considering unobserved heterogeneity across websites. TS encodes model uncertainty by sampling from the posterior distribution.
- The conversion model based on TS and HGLM is described in detail, as well as how to calculate the ad allocation probability for each website using TS, which involves calculating posterior probabilities and approximating by simulation sampling.
- 模型估计与更新
- 在大规模实时实验中,采用限制最大似然估计的拉普拉斯近似来获得后验抽样,更新模型时使用所有可用数据重新估计。还提到TS与任何模型兼容,可以应用于各种模型规格。
- Model Estimation and Update
- In a large - scale real - time experiment, the Laplace approximation of restricted maximum likelihood estimation is used to obtain posterior samples. When updating the model, all available data are used for re - estimation. It is also mentioned that TS is compatible with any model and can be applied to various model specifications.
5. 替代MAB策略
- 吉廷斯指数(Gittins Index)
- 介绍了Gittins指数在营销和管理科学中的应用,它在特定假设下是MAB问题的最优解。本文将在反事实分析中测试其不同版本,使用其封闭形式近似,并说明了其结构特点。
- 上置信界政策(Upper Confidence Bound Policies)
- 介绍了UCB政策的起源和理论基础,它在强化学习中被广泛研究。详细描述了UCB1算法及其变体UCB - tuned,以及它们在不同条件下的应用和扩展,包括属性相关问题和不同动作相关性情况。
- 更简单的启发式策略(Simpler Heuristics)
- 评估了一些简单且理论性较弱的启发式策略,包括测试 - 推广政策(test - rollout policies)、贪心政策(greedy policy)和epsilon - 贪心政策(epsilon - greedy policy),分别介绍了它们的决策规则和特点。
- Gittins Index
- The application of the Gittins Index in marketing and management science is introduced. It is an optimal solution to the MAB problem under specific assumptions. Different versions of it will be tested in the counterfactual analysis of this paper, using its closed - form approximation, and its structural characteristics are described.
- Upper Confidence Bound Policies
- The origin and theoretical basis of the Upper Confidence Bound (UCB) policies are introduced. They are widely studied in reinforcement learning. The UCB1 algorithm and its variant UCB - tuned are described in detail, as well as their applications and extensions in different conditions, including attribute - related problems and different action correlations.
- Simpler Heuristics
- Some simple and less theoretically rich heuristics are evaluated, including test - rollout policies, greedy policies, and epsilon - greedy policies. Their decision rules and characteristics are described respectively.
6. 实地实验结果
- 实验实施
- 与银行及其在线媒体购买机构合作实施了大规模MAB实地实验,将80%的展示次数分配给TS - HGLM政策组,每约一周改变一次分配;20%作为对照组,采用均衡分配政策。
- 结果比较
- 通过比较两组的总体获取率,发现TS - HGLM政策相对于静态均衡设计提高了8%的总体获取率,这意味着增加了约240个新客户,同时降低了每次获取的成本(CPA)。还总结了不同广告概念和尺寸的累计转化率和点击率。
- Experiment Implementation
- A large - scale MAB field experiment was implemented in collaboration with a bank and its online media - buying agency. 80% of the impressions were allocated to the TS - HGLM policy group, and the allocation was changed approximately every week. 20% served as a control group with an equal allocation policy.
- Result Comparison
- By comparing the overall acquisition rates of the two groups, it was found that the TS - HGLM policy improved the overall acquisition rate by 8% compared to the static balanced design, which means an increase of about 242 new customers and a decrease in the cost per acquisition (CPA). The cumulative conversion rates and click - through rates for different ad concepts and sizes were also summarized.
7. 通过模拟复制实地实验
- 模拟设置
- 通过模拟复制实地实验来捕捉两种实施政策(TS - HGLM和均衡政策)性能的不确定性,使用非参数方法定义“真实”转化率,假设不同政策不改变广告的真实转化率,只改变展示次数的分配。
- 模拟转化次数是根据政策推荐的分配权重生成的二项分布成功次数,主要性能指标是总体转化率,还通过后验预测P - 值量化不同政策性能的变异性。
- 结果分析
- 发现实际实施的TS - HGLM政策的改进水平在模拟均衡设计政策的预测分布中是异常的,且TS - HGLM政策在每个模拟世界中都优于均衡政策,验证了反事实模拟的有效性。
- Simulation Setup
- The field experiment was replicated via simulation to capture the uncertainty of the performance of the two implemented policies (TS - HGLM and balanced). A non - parametric method was used to define the "true" conversion rate, assuming that different policies do not change the true conversion rate of ads but only the allocation of impressions.
- The simulated conversions were generated as binomial successes according to the policy - recommended allocation weights. The main performance metric was the overall conversion rate, and the variability of different policies' performance was quantified by a posterior predictive P - value.
- Result Analysis
- It was found that the improvement level of the actually implemented TS - HGLM policy was an outlier in the predicted distribution of the simulated balanced design policy, and the TS - HGLM policy outperformed the balanced policy in every simulated world, validating the counterfactual simulations.
8. 基于实地实验数据的政策反事实模拟
- 评估MAB政策的模型组件
- 比较了使用TS分配规则的一系列MAB政策,包括不同复杂程度的模型,如均匀回归(TS - GLM)、潜在类别回归(TS - LCGLM)等,通过箱线图展示它们的性能分布。
- 结果表明网站间的部分池化是重要的,TS - HGLM在平均改进方面优于一些涉及网站同质性的TS池化政策,但不如TS - BB - unpooled政策。还讨论了部分池化模型和非池化模型的优缺点,以及在不同样本量情况下的表现。
- Evaluating the Model Component of the MAB Policy
- A series of MAB policies using the TS allocation rule were compared, including models of different complexity levels, such as homogeneous regression (TS - GLM), latent - class regression (TS - LCGLM), etc. Their performance distributions were shown by box plots.
- The results showed that partial pooling across websites is important. The TS - HGLM outperformed some TS - pooled policies involving website homogeneity in terms of average improvement but was inferior to the TS - BB - unpooled policy. The advantages and disadvantages of partial pooling models and non - pooling models were also discussed, as well as their performance in different sample sizes.
- 评估MAB政策的分配规则组件
- 评估了一系列替代分配规则,包括标准的启发式规则(如贪心和epsilon - 贪心)和已知的最优解(如UCB政策和Gittins指数政策),以及不同初始测试期长度的测试 - 推广政策。
- 发现Gittins和UCB政策的非池化版本以及贪心和epsilon - 贪心的非池化政策在性能上优于TS - HGLM政策,而测试 - 推广政策的非池化版本在特定条件下表现较好,且其性能对选择获胜者的水平和测试期长度敏感。
- 总结了政策反事实模拟的关键发现,包括模型选择对政策性能的影响大于分配规则选择,非池化政策通常优于池化政策,以及部分池化模型在性能变异性方面的特点。
- Evaluating the MAB Policy's Allocation Rule Component
- A series of alternative allocation rules were evaluated, including standard heuristics (such as greedy and epsilon - greedy) and known optimal solutions (such as UCB policy and Gittins Index policy), as well as test - rollout policies with different initial test period lengths.
- It was found that the non - pooled versions of the Gittins and UCB policies, as well as the non - pooled greedy and epsilon - greedy policies, outperformed the TS - HGLM policy in performance. The non - pooled version of the test - rollout policy performed well under certain conditions and its performance was sensitive to the level of selecting a winner and the test period length.
- The key findings of the policy counterfactual simulations were summarized, including that model selection has a greater impact on policy performance than allocation rule selection, non - pooled policies generally outperform pooled policies, and the characteristics of partial pooling models in terms of performance variability.
9. 评估对问题设置变化的敏感性
- 更新时间、批量大小和发生率的变化
- 将先前讨论的政策用每日更新而不是每周更新重新运行,发现政策结果对批量大小的变化具有一定的稳健性,减少批量大小对非池化政策和赢者通吃政策有更强的影响,对TS - 基于政策的改进较小。
- 改变目标:优化点击而非转化
- 重新分析了以点击为结果的各种MAB政策,发现优化点击率会导致比直接优化转化率低12%的获取率,因为转化效果最好的广告不一定是点击效果最好的广告。还分析了政策在优化点击时的效果以及不同政策受影响的程度。
- Changes in Update Timing, Batch Size, and Incidence Rate
- The previously discussed policies were rerun with daily updates instead of weekly updates. It was found that the policies' results were somewhat robust to changes in batch size. Reducing the batch size had a stronger impact on non - pooled policies and winner - take - all policies, with a smaller impact on TS - based policies.
- Changing the Goal: Optimizing Clicks Instead of Conversions
- Various MAB policies were reanalyzed with clicks as the outcome. It was found that optimizing click - through rate led to a 12% lower acquisition rate than directly optimizing conversions because the best ads for conversions are not the best for clicks. The performance of policies when optimizing clicks and the degree to which different policies were affected were also analyzed.
10. 讨论与结论
- 研究贡献
- 将在线广告主的问题转化为MAB问题框架,提出分层回归模型和随机分配规则相结合的策略,并通过实地实验证明其有效性,同时通过反事实模拟分析了不同策略的性能,为营销实践提供了指导。
- 研究局限性与未来研究方向
- 承认实地实验和模拟存在局限性,如未完全捕捉客户获取漏斗的所有方面、未考虑个体层面数据和重复广告曝光的影响、未考虑有限时间范围和人口规模等。未来研究可针对这些局限进行拓展,如结合广告归因建模、考虑动态治疗制度、研究媒体购买和规划中的复杂交互等。
- Research Contributions
- The problem of online advertisers was transformed into an MAB problem framework, and a strategy combining hierarchical regression models and random allocation rules was proposed. Its effectiveness was demonstrated through a field experiment, and the performance of different strategies was analyzed through counterfactual simulations, providing guidance for marketing practice.
- Research Limitations and Future Research Directions
- The limitations of the field experiment and simulations were acknowledged, such as not fully capturing all aspects of the customer acquisition funnel, not considering individual - level data and the impact of repeated ad exposures, and not considering the finite time horizon and population size. Future research could address these limitations, such as combining ad attribution modeling, considering dynamic treatment regimes, and studying the complex interactions in media buying and planning.
版权声明:
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。
如若内容造成侵权、违法违规、事实不符,请将相关资料发送至xkadmin@xkablog.com进行投诉反馈,一经查实,立即处理!
转载请注明出处,原文链接:https://www.xkablog.com/bcyy/75251.html