Abstract
In order to understand how children cope with the enormous variation in structures worldwide, developmental paths need to be studied in a sufficiently varied sample of languages. Because each study requires very large and expensive longitudinal corpora (about one million words, five to seven years of development), the relevant sample must be chosen strategically. We propose to base the choice on the results of a clustering algorithm (fuzzy clustering) applied to typological databases. The algorithm establishes a sample that maximizes the typological differences between languages. As a case study, we apply the algorithm to a dozen typological variables known to have an impact on acquisition, concerning such issues as the presence and nature of agreement and case marking, word order, degrees of synthesis, polyexponence and inflectional compactness of categories, syncretism, the existence of inflectional classes etc. The results allow deriving small samples that are maximally diverse. As a side result, we also note that while the clustering algorithm allows maximization of diversity for sampling purposes, the resulting clusters themselves are far from being discrete and therefore do not reflect a natural partition into basic language types.