[Context and motivation] Traditionally, requirements are documented using natural language text. However, there exist several approaches that promote the use of rich media requirements descriptions. Apart from text-based descriptions these multimodal requirements can be enriched by images, audio, or even video. [Question/Problem] The transcription and automated analysis of multimodal information is an important open question, which has not been sufficiently addressed by the Requirement Engineering (RE) community so far. Therefore, in this research preview paper we sketch how we plan to tackle research challenges related to the field of multimodal requirements analysis. We are in particular focusing on the automation of the analysis process. [Principal idea/results] In our recent research we have started to gather and manually analyze multimodal requirements. Furthermore, we have worked on concepts which initially allow the analysis of multimodal information. The purpose of the planned research is to combine and extend our recent work and to come up with an approach supporting the automatic analysis of multimodal requirements. [Contribution] In this paper we give a preview on the planned work. We present our research goal, discuss research challenges and depict an early conceptual solution.