The use of spiking neuromorphic sensors with state-of-art deep networks is currently an active area of research. Still relatively unexplored are the pre-processing steps needed to transform spikes from these sensors and the types of network architectures that can produce high-accuracy performance using these sensors. This paper discusses several methods for preprocessing the spiking data from these sensors for use with various deep network architectures. The outputs of these preprocessing methods are evaluated using different networks including a deep fusion network composed of Convolutional Neural Networks and Recurrent Neural Networks, to jointly solve a recognition task using the MNIST (visual) and TIDIGITS (audio) benchmark datasets. With only 1000 visual input spikes from a spiking hardware retina, the classification accuracy of 64.5% achieved by a particular trained fusion network increases to 98.31% when combined with inputs from a spiking hardware cochlea.