Navigation auf zora.uzh.ch

Search ZORA

ZORA (Zurich Open Repository and Archive)

Workflow analysis of data science code in public GitHub repositories

Ramasamy, Dhivyabharathi; Sarasua, Cristina; Bacchelli, Alberto; Bernstein, Abraham (2022). Workflow analysis of data science code in public GitHub repositories. Empirical Software Engineering, 28:7.

Abstract

Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem.

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:03 Faculty of Economics > Department of Informatics
08 Research Priority Programs > Digital Society Initiative
Dewey Decimal Classification:000 Computer science, knowledge & systems
Scope:Discipline-based scholarship (basic research)
Language:English
Date:2022
Deposited On:22 Nov 2022 07:09
Last Modified:28 Dec 2024 02:37
Publisher:Springer
ISSN:1382-3256
OA Status:Hybrid
Free access at:Publisher DOI. An embargo period may apply.
Publisher DOI:https://doi.org/10.1007/s10664-022-10229-z
Official URL:https://link.springer.com/article/10.1007/s10664-022-10229-z
Other Identification Number:merlin-id:22968
Download PDF  'Workflow analysis of data science code in public GitHub repositories'.
Preview
  • Content: Published Version
  • Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)

Metadata Export

Statistics

Citations

Dimensions.ai Metrics
4 citations in Web of Science®
4 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

75 downloads since deposited on 22 Nov 2022
40 downloads since 12 months
Detailed statistics

Authors, Affiliations, Collaborations

Similar Publications