Despite the increasing use of citation-based metrics for research evaluation purposes, we do not know yet which metrics best deliver on their promise to gauge the significance of a scientific paper or a patent. We assess 17 network-based metrics by their ability to identify milestone papers and patents in three large citation datasets. We find that traditional information-retrieval evaluation metrics are strongly affected by the interplay between the age distribution of the milestone items and age biases of the evaluated metrics. Outcomes of these metrics are therefore not representative of the metrics’ ranking ability. We argue in favor of a modified evaluation procedure that explicitly penalizes biased metrics and allows us to reveal metrics’ performance patterns that are consistent across the datasets. PageRank and LeaderRank turn out to be the best-performing ranking metrics when their age bias is suppressed by a simple transformation of the scores that they produce, whereas other popular metrics, including citation count, HITS and Collective Influence, produce significantly worse ranking results.