Propensity score matching is increasingly being used in the medical literature. Choice of matching algorithms, reporting quality, and estimands are oftentimes not discussed. We evaluated the impact of propensity score matching algorithms, based on a recent clinical dataset, with three commonly used outcomes. The resulting estimands for different strengths of treatment effects were compared in a neutral comparison study and based on a thoroughly designed simulation study. Different algorithms yielded different levels of balance after matching. Along with full matching and genetic matching with replacement, good balance was achieved with nearest neighbor matching with caliper but thereby more than one fifth of the treated units were discarded. Average marginal treatment effect estimates were least biased with genetic or nearest neighbor matching, both with replacement and full matching. Double adjustment yielded conditional treatment effects that were closer to the true values, throughout. The choice of the matching algorithm had an impact on covariate balance after matching as well as treatment effect estimates. In comparison, genetic matching with replacement yielded better covariate balance than all other matching algorithms. A literature review in the British Medical Journal including its subjournals revealed frequent use of propensity score matching; however, the use of different matching algorithms before treatment effect estimation was only reported in one out of 21 studies. Propensity score matching is a methodology for causal treatment effect estimation from observational data; however, the methodological difficulties and low reporting quality in applied medical research need to be addressed.