Abstract
Parsing complex acoustic scenes involves an intricate interplay between bottom-up, stimulus-driven salient elements in the scene with top-down, goal-directed, mechanisms that shift our attention to particular parts of the scene. Here, we present a framework for exploring the interaction between these two processes in a simulated cocktail party setting. The model shows improved digit recognition in a multi-talker environment with a goal of tracking the source uttering the highest value. This work highlights the relevance of both data-driven and goal-driven processes in tackling real multi-talker, multi-source sound analysis.