The Value of Publicly Available, Textual and Non-textual Information For Startup Performance Prediction

Can publicly available, web-scraped data be used to identify promising business startups at an early stage? To answer this question, we use such textual and non-textual information about the names of Danish firms and their addresses as well as their business purpose statements (BPSs) supplemented by core accounting information along with founder and initial startup characteristics to forecast the performance of newly started enterprises over a five years' time horizon. The performance outcomes we consider are involuntary exit, above-average employment growth, a return on assets of above 20 percent, new patent applications and participation in an innovation subsidy program. Our first key finding is that our models predict startup performance with either high or very high accuracy with the exception of high returns on assets where predictive power remains poor. Our second key finding is that the data requirements for predicting performance outcomes with such accuracy are low. To forecast the two innovation-related performance outcomes well, we only need to include a set of variables derived from the BPS texts while an accurate prediction of startup survival and high employment growth needs the combination of (i) information derived from the names of the startups, (ii) data on elementary founder-related characteristics and (iii) either variables describing the initial characteristics of the startup (to predict startup survival) or business purpose statement information (to predict high employment growth). These sets of variables are easily obtainable since the underlying information is mandatory to report upon business registration. The substantial accuracy of our predictions for survival, employment growth, new patents and participation in innovation subsidy programs indicates ample scope for algorithmic scoring models as an additional pillar of funding and innovation support decisions.


INTRODUCTION
Identifying promising startups is a formidable task for investors, creditors and policy makers alike. Even though each group often has quite a wealth of information available when deciding about a possible involvement in a particular startup, this information must be processed quickly which in turn implies that simple heuristics become highly valuable (Baum and Wally 2001;Eisenhardt 1989;Kirsch et al. 2009). In addition, investors and creditors aim at identifying promisinging startups early and therefore increasingly often use algorithmic scoring models (Corea 2018; Diffey 2019; Palmer 2017). More generally, uncertainties in the ex-ante evaluation of business opportunities are fundamental to the theory and the empirical testing of entrepreneurial strategy (Ahuja et al. 2005;Amit et al. 1990;Dencker and Gruber 2015;Nikiforou et al. 2019;Oriani and Sobrero 2008).
That publicly available information can be used to effectively measure entrepreneurial success and hence to reduce uncertainties is demonstrated in seminal work by Guzman and Stern (2015;G/S hereafter). G/S use information on firm and founder names, geographical location as well as an indicator for a startup holding a patent at the time of foundation to measure entrepreneurial success defined as either an IPO or an acquisition at the ZIP-code level. In this paper, we study how well the initial G/S variables in combination with other publicly available and similarly conveniently obtainable data can predict a broad range of performance outcomes: involuntary exit, high employment growth, a return on assets of above 20 percent, new patent applications and, as a more inclusive indicator of innovative activity, participation in an innovation subsidy program.
Our set of performance predictors comprises of (i) the initial G/S firm name variables, (ii) an extended set of variables derived from firm names, (iii) basic founder characteristics such as gender and previous founding experience as well as business success, (iv) initial startup characteristics like industry affiliation, initial assets and profits as well as address information and (v) variables generated from firms' business purpose statements (BPSs). BPSs are required by most US states and most European countries as an integral part of the business formation documents. They are, e.g., mandatory for corporations worldwide where they are also referred to as "articles of organization", "articles of incorporation" or "certificate of incorporation".
We base our analysis on the population of Danish firms started as incorporated companies between 2012 and 2014, 55914 firms in total, whose data we web-scrape from government websites. To assess the changes in forecasting accuracy that our extended lists of potential predictors cause, we employ simple logit models for our five performance models and calculate the respective areas under the receiver-operator curve (AUC) as our main measure of prediction accuracy. The AUC is a frequently applied forecast performance statistic for binary firm performance models (Agarwal andTaffler 2008,Åstebro andWinter 2012;Chava and Jarrow 2004). We assess the contribution of each set of explanatory variables since not all data may be publicly available in all countries and since there are differences in their ease of use.
Our key findings are that (i) our models predict all performance outcomes with high accuracy with the exception of high return on assets, (ii) the data needed to generate our precise forecasts are both easily obtainable and straightforward to apply in simple empirical models and (iii) prediction accuracy can be substantially improved by including variables beyond the ones initially suggested by G/S. Predicting our two innovation-related performance indicators with high accuracy only requires the set of variables we derive from the BPSs. Combining the BPS variables with the initial G/S variables even leads to predictions of very high accuracy for new patent applications.
Accurately predicting involuntary exit and high employment growth is more data demanding as both involve the combination of three different sets of variables. A satisfactory prediction of involuntary exit and high employment growth needs the basic G/S variables in combination with the set of founder characteristics. On top of these two sets of variables, predicting involuntary exit involves the set of initial startup characteristics while predicting high employment growth entails the additional inclusion of the BPS-derived set of variables. Importantly, the basic G/S variables, the founder characteristics and the data derived from the BPSs are likely to be easily accessible since they are mandatory to report to the authorities upon business registration. We hence not only demonstrate that it is possible to accurately predict startup success, we also show that the data required to generate such accurate predictions may in fact be readily available from public sources. This is of particular interest given a global trend towards the opening of business register data to the public. Initiatives like the "Open Government Partnership" with its explicit goal to ease the access to public data are getting more and more traction with now including 79 countries worldwide. Data sets similar to ours hence are or will soon be available in many other countries (https://www.opengovpartnership.org/). Our paper unfolds as follows: we first present our data, then introduce our empirical methods, subsequently discuss our empirical results and finally conclude.

DATA
Our core data is generated and collected by the Danish Business Authority (DBA), an administrative unit under the authority of the Danish Ministry of Business. We track all firms started between 2012 and 2014 over a period of five years. The data comprises of the universe of 55914 firms registered as limited liability companies (LLCs), joint stock corporations or a new form of a LLC called "ivaeksaetterselskab" (IVS) whose main difference to a standard LLC is that it does not come with capital requirements and hence in effect without liabilities on part of the owners. The DBA data also provide us with the company names and addresses, NACE Rev. 2 industry codes, starting dates, total assets, profits, the number of employees as well as the names and person identifiers of their founders. In addition, the DBA data contain the BPSs since firms are obliged to report their business purpose as part of their general charters.
Business purpose statements are mandatory by the Danish Law of Corporated Firms which provides firms with substantial leeway in their eventual formulation as there is no wordcount limit and the BPSs only need to loosely describe a startups' activity. As a consequence, many BPSs are very generic ("The purpose of this firm is to do trading.") while others are very specific. 1 We shall make use of this heterogeneity in our empirical analysis.

Dependent variables
We consider five alternative performance variables: (i) involuntary exit, (ii) high employment growth, (iii) a return on assets of above 20 percent, (iv) at least one patent after foundation and (v) participation in an innovation subsidy program; variables that, except for the last one, are commonly used in management and economics. New business survival is very widely studied (Audretsch and Mahmood 1995;Cassar 2014;Chava and Jarrow 2004;Gimmon and Levie 2010). Visitin and Pittino (2014) as well as Wennberg et al. (2011) consider employment growth as a main performance outcome. Return on asset is considered by Morgan et al. (2009) as well as Cornett and Tehranian (1992) while patents are standard indicators for innovative activity (Blundell et al. 1995;Griliches 1990;Kaiser et al. 2015Kaiser et al. , 2019. However, not all inventions are patented and not all inventions can be patented (Arundel and Kabla 1998). We therefore consider participation in an innovation subsidy program as an additional and broader indicator of innovative activity. All Danish innovation subsidy programs are competitive and reviewed, which in turn implies that the program sponsors assessed that the applicant firm exceeds the quality threshold for the respective subsidization program.
1 E.g., "The company's purpose is to design and develop, manufacture and assemble switchboards, steering and control boards, PLC/PC/SRO solutions, automation and pre-finished projects for use by fitters, OEM/system manufacturers and the industry in general at a quality and at a price that entails that customers, suppliers and other stakeholders regard the company as an attractive and professional partner." We measure all performance variables within the first five years after establishment, except for return on assets which we measure within the first three years after foundation due to a substantial increase in missing information over a five year time horizon -many firms that started in 2014 have not yet submitted in their fifth year financial report. We define involuntary exits as closures due to bankruptcy and compulsory dissolution enforced by the regulatory authorities due to non-compliance to administrative requirements. It does not include dissolution after a merger or an acquisition which would count as business success (Bates 2005; Detienne and Wennberg 2014; G/S) or voluntary exits. Employment figures are provided in categories of 0, 1, 2-4, 5-9, 10-199 and more than 199 employees. We term startups that increase employment by at least two categories as "high employment growth" businesses since each category implies a doubling of the number of employees. Our final two performance measures refer to innovative performance: patents and participation in an innovation subsidy program. Our patent application data originates in the "PatStat" database provided by the European Patent Office to which researchers at Copenhagen Business School have attached the unique Danish identifiers which allow us to combine our data sets (Kaiser et al. 2015;2019). It includes all patents filed at the European Patent Office or the World Intellectual Property Organization that involve at least one Danish applicant or inventor. We have data on the universe of Danish innovation support schemes collected by Danish Ministry of Higher Education and Science at our disposal. 2

Explanatory variables
We relate our five performance variables to our five sets of explanatory variables, e.g. the basic G/S variables, the extended G/S firm name variables, founder characteristics, startup characteristics and BPS information, as well as combinations thereof.
(i) The G/S variables: Our first set of explanatory variables follows Guzman and Stern (2015). Their model to predict startup performance contains dummy variables for (i) the firm name being eponymous (i.e. it reflects one of the founder's names), (ii) the firm name being short or long, (iii) the geographical location appearing in the firm name (specified as a dummy for any geographical location like a city, village or region appearing in the firm name and another dummy for the terms "Denmark", "Danish" or "Dan" in the firm name) (iv) the legal form (dummies for corporations and IVSs with LLCs as base category), (v) the geographic regions the firm is residing in and (vi) the startup commanding over at least one patent at the time of foundation.
(ii) The extended G/S variables: We extend this basic set of variables derived from the firm names by dummy variables for the firm name containing (i) a "proper" word which we define based on the dictionary of Danish words as a proxy for the firm name containing information on what the firm actually does (like "baking", "consulting" or "plumbing"), (ii) the terms "holding", "capital", "invest" or "share" in the firm name to identify holding companies as well as (iii) a female name and (iv) a male name. We in addition include a (v) founder name index since social psychology and economics suggest that person names constitute strong indicators of a persons' background (Fryer et al. 2004;Gerhards and Hans 2009;Goldstein and Stecklov 2016;Mehrabian 1997). To account for potential information contained in founder names, we build a "name index" by calculating the name-specific average performance of firms started by founders with a focal given name. We e.g. find that 15 percent of the founders named "Ulrich" face a forced exit within the first five years while this is the case for ten percent of the founders with the given name "Johan". For solo founders named Ulrich this generates an index of 0.15, for founders named Johan it is 0.1. For team foundations we take the averages across the set of founder names.
(iii) The human capital variables: As a third set of variables we employ information on the startups' founders at the time of business foundation. These include dummy variables for (i) at least one founder being a legal entity, (ii) at least one founder having a female first name, (iii) at least one founder having a male first name, (iv) the startup being founded by a team (e.g. more than one person founder), (v) the five number of employees categories described above with this information being missing as the comparison group, (vi) one of the founders having previously founded between one and three other firms and (vii) one of the founders having previously founded between more than three other firms and (viii) one of the founders having previously experienced an involuntary exit.
Firm size at startup has been shown to be highly correlated with post-entry performance (Arora and Nandkumar 2011;Bonardo et al. 2011;Brüderl et al. 1992 (iv) The firm characteristics: Our fourth set of explanatory variables concerns itself with the characteristics of the startup. Financial information has long been used as a predictor for business performance (Altman , 1984Brüderl et al. 1992;Dambolena and Khoury 1980;Huyghebaert et al. 2000;Laitinen 1992). We account for total assets and total profits in the first year. Both variables are missing for half of our observations, a "sparsity of data" problem that is very common in big datasets. Following Gelman and Hill (2007), we set the missing corresponding explanatory variables to zero and in order to distinguish genuine 0s from the artificially created 0s introduce an additional dummy for such replacements having taken place.
Since information on total assets is missing in all cases where information on profits is missing as well, we only need to include a single indicator for such a replacement having taken place.
We operationalize total profits by using quantiles dummies while we take the natural logarithm of profits.
Our data contains detailed address information and we use this text as data to create indicators for the business history of each address and for the address being shared with other firms.
Specifically, we include a dummy variable for at least one involuntary exit at the respective address as well as another dummy variable for nine or more involuntary exits at the address.
These two dummy variables may serve as proxies for the overall attractiveness of the location and other characteristics associated with the given address. We control for how many other firms reside under the same address since many corporations often co-reside with their associated holding companies by including dummies for the address being shared by 2-5, 6-10 and more than 10 other firms with the address being unshared being the base category. In addition, we account for the present address having previously been used by 1-5, 6-10, 11-100 and more than 100 firms. To account for differences across sectors (Brüderl et al. 1992;Clarysse et al. 2011), we include a set of NACE Rev. 2 one digit sector dummy variables. A missing sector classification constitutes our base category. To more precisely account for sectoral heterogeneity without being forced to include a large set of sectoral dummy variables, we include mean industry performance for all our five performance indicators which we calculate on the basis of the Danish Industry Classification that is slightly more detailed than NACE Rev. 2 four digit level. 3 (v) The BPS data: Our fifth set of explanatory variables uses the BPS data. Before turning the BSP text data into explanatory variables we remove words and phrases which do not contain information relevant to our analysis, an approach called "stopping" in computer linguistics. Examples for stopwords are "the", "because", "between" or "against". In addition, we "stemmed" all words in the BPSs. Stemming reduces words to their roots, e.g. the words "automation" and "automated" would both be reduced to their root "autom". We use the dictionary of the Danish Language Authority as our source for stemming. After stopping and stemming we define three subsets of BPS-related variables that either relate to BPS complexity, to its specificity or to its very content. As measures of BPS complexity we consider (i) the "LIX" due to Björnsson (1968) which has found widespread application in text analysis.
It is calculated as the sum of the percentage of words of more than six letters and the average number of words per BPS in our context. The higher the LIX, the higher is the complexity of the text. We in addition use complexity-related variables measuring (ii) mean word length, (iii) BPS length and (iv) dummy variables for the quintiles of the BPS length distribution to put BPS length into perspective. To measure BPS specificity we use counts of how many times a "proper" word in a focal BPS appears in the universe of BPSs. We operationalize these counts as (v) the frequency with which the least common word in a focal BPS appears in the universe of BPSs and (vi) the frequency with which the most common word in a focal BPS appears in the universe of BPSs. We also control for the ratio of these two variables. Similar to our treatment of our firm name information we finally create the following content-related variables: (vii) a dummy for a geographic term appearing in the BPS, (viii) a dummy variable for a male name appearing in the BPS and (ix) a dummy variable for a female name appearing in the BPS. As a final subset of the BPS variables we generate (x) "wordscore indices" that measure the mean "performance" of firms' BPSs for each of our five performance indicators.
The wordscore approach has been developed in political sciences where it has found widespread application in inferring political positions in text documents on the basis of scores for words derived from documents (Laver et al. 2003). It is perhaps best illustrated by providing an example. A share of 47.4 percent of the startups with the word "discotheque" in their BPSs face an involuntary exit while this is true for 36.4 percent of the startups with the word "delivery" in their BPSs. The wordscore associated with the word "discotheque" is defined as the word's average "performance" and hence is 0.474 while the other wordscore is 0.364. To aggregate the individual wordscores at the firm name level, we take the average of the individual wordscores. Table 1 presents descriptive statistics of our dependent and explanatory variables. It shows that involuntary exits are comparatively rare events with 17.2 percent of the firms in our data involuntarily exiting within five years of operation, a figure that is substantially lower than the 50 percent overall exits reported by e.g. Headd (2003) or Mata and Portugal (1994). Different to those studies we focus, however, on involuntary exits as well as firms with a legal form that requires registration and consider the universe of startups instead of merely technology-driven ones. A tenth of the firms in our data generate substantial employment growth while 17 percent achieve a high return on assets. By contrast, participation in an innovation subsidy program and taking out a new patent are both rare events. A mere 1.6 percent of our firms participate in innovation subsidy programs while only 0.3 percent apply for a new patent within their first five years of existence.
More than 40 percent of all startups are founded in the capital greater Copenhagen region, only 0.4 percent of all firms has applied for a patent at the time of foundation, about a third of the startups involve another firm as a founder, 87 percent of the startups are founded by men, more than 89 percent are solo foundations and 46 percent are founded by serial entrepreneurs which compares to a European average of 30 percent and a US average of 13 percent (Plehn-Dujowich 2010). Turning to the information contained in the BPSs, the average LIX is 54 which is considered as "difficult" by Björnsson (1968). The mean word length is at 9.4 characters while average BSP lengths is 41.3 characters.
The correlations between our explanatory variables are modest with our largest variance inflation factor being 2.56 which is well below the critical value of 10 (Belsley et al. 1980).

Empirical strategy
Our empirical aim is twofold: we want to analyze (i) the degree of accuracy to which publicly available data can be used to forecast business startup performance and (ii) what sets of variables -and combinations thereof -are best at predicting performance since not all variables may be equally easy to get a handle on. We seek to achieve our goals by subsequently introducing our five different sets of explanatory variables as well as their combinations in logit performance regressions and by subsequently assessing the out-of-sample prediction accuracy of our specifications. We estimate our models on a 70 percent random sample and retain the remaining 30 percent for prediction, following G/S. We calculate our firm name indices and our BPS wordscores as well as the average industry performance index on the regression sample and extrapolate them to our holdout sample.
Our focus is on the prediction of outcomes and we therefore present the forecasting accuracy statistics only and relegate logit coefficient estimation results for our full models to Appendix A. We apply three different prediction accuracy measures: (i) the AUC, (ii) the log-likelihodd value and (iii) the Bayesian Information Criterion (BIC). Our focus is on the AUC as a standard measure of forecast performance of binary models (Hand 2001). It illustrates the performance of a classification model like ours by plotting the observed rate of outcomes against the rate of false positive outcomes at pre-specified threshold levels (the receiver-operator curve, ROC), deciles in our case as in Cooper et al. (1993). The area under thes curve is a measure of predictive accuracy where an AUC of 0.5 suggests no predictive power at all while a value of 1 corresponds to perfect prediction. Bradley (1997) defines a model that corresponds to an AUC of between 0.5 and 0.6 as a "fail', values between 0.6 and 0.7 as "poor", between 0.7 and 0.8 as "fair", between 0.8 and 0.9 as "good" and values above 0.9 as "excellent". In addition, we calculate the percentage changes in the AUC compared to the specification that uses the Guzman/Stern set of variables only. Almost all our models include the basic G/S set of variables which allows us to compare the log-likelihood values of the basic G/S model to the richer models as suggested by standard textbooks (Greene 2017;Wooldridge 2016) as a second prediction accuracy statistic. Our results table displays the percentage change in the log-likelihood statistics compared to the G/S benchmark model which is equal to the relative change in the associated likelihood-ratio test statistics. These test statistics cannot reject that all models that include variables beyond the basic G/S ones are jointly statistically highly significant; i.e., the fuller models have statistically significantly larger explanatory power than the base G/S specification. This is why we do not provide the p-values in our results table. Our third alternative useful textbook statistic to study differences between both nonnested and nested models is the Bayesian Information Criterion (BIC), a statistic frequently used for model selection (Kass and Raftery 1995). Adding additional parameters may lead to overfitting, a problem which the BIC attempts to solve by penalizing extra parameters added to the empirical model. The preferred model is the one with the lowest BIC. Our results table displays the changes in the BIC relative to the basic G/S model along with a categorization of these percentage changes into "not worth more than a mention" (abbreviated in the table by "none"), "positive", "strong" and "very strong" which correspond to changes between 0 and 2, 2 to 6, 6 to 10 and above 10 respectively (Jeffreys 1935;Kass and Raftery 1995).

Results
Table 2 presents our prediction outcomes. A first striking finding is that the information contained in our BPS-related variables is rich enough to alone predict the two innovationrelated outcomes with "good" accuracy. An even "excellent" accuracy is achieved once the BPS data is combined with both the human capital variables and the basic G/S variables.
An "excellent" predictive performance is not obtained for any other performance outcome.
A second striking result is that all our specifications poorly predict a high return on assets.
Even though combining the initial G/S variables with the firm characteristics and the BPS information leads to a massive improvement in predictive accuracy by 22.2 percent as measured by the AUC, it still remains"poor" with an AUC of 0.686.
Involuntary exit is predicted with "good" accuracy with an AUC of 0.801 when the basic G/S variables are combined with the set of human capital variables and the BPS data. Predictive power can be increased by 2.7 percent if the BPS variables are added as well, leading to an AUC of 0.823. Adding even more variable sets does, however, not increase predictive power. Similarly, it also needs the combination of at least three sets of variable, the basic G/S variables, the human capital and the BPS variables, to attain "good" predictive accuracy for high employment growth. Adding variables sets beyond these three actually decreases AUC since AUC penalizes the number of explanatory variables.
Turning to the changes in the log-likelihood function as alternative prediction accuracy measures, we naturally find that the more sets of variables we include, the larger the loglikelihood values become because logit models maximize the log-likelihood functions without penalizing the number of explanatory variables. These improvements are largest for adding the set of startup characteristics and the set of BPS-related variables to the initial G/S specification, which is a finding that reinforces our initial AUC-based results. Likewise, the changes in the BIC echo these initial results as well since the addition of the set of initial startup characteristics and the BPS-related variables lead to "strong" or "very strong" reductions in the BIC.
To sum up, our models predict startup survival, high employment growth and participation in an innovation subsidy program well. They predict new patents very well but fail to predict high returns on assets with acceptable accuracy. Our results show that it is sufficient to include the BPS-related variables to generate a "good" predictive accuracy for new patents and participation in an innovation subsidy program. To get the "excellent" predictive accuracy for new patents the BPS variables need to be combined with the basic G/S variables and the founder characteristics. An accurate prediction of involuntary exit and high employment growth requires the combination of the three sets of variables where both predictions need the basic G/S variables and the human capital characteristics. Predicting involuntary exit in addition involves the inclusion of the startup characteristics while the high employment growth forecast additionally entails the BPS-generated variables. We hence find that the initial G/S variables, the human capital variables and the BPS-related variables are key contributors to startup success prediction. At the same time, these variables are particularly easy to obtain since they are primarily based on textual information on the names of the startups and their founders and therefore data that is mandatory to report upon business registration.

Robustness checks
Even though all data we use in our analysis is publicly available, not all variables may be easily obtainable in all countries. In addition, not all variable are equally simple to process. The initial G/S variables include information on whether or not a startup has applied for a patent at the time of incorporation while the set of human capital variables includes initial firm size.
Even though information on previous patenting activity is easily gathered via online searches for individual firms, it is very cumbersome to match startups to their corresponding patenting history on a broader scale. Likewise, publicly available information on startups often does not contain information on initial firm size which limits the direct applicability of our startup characteristics variables. Finally, the wordscores constitute important elements of the set of BPS-related variables. Again, while individual BPSs indeed are easily obtainable, processing the universe of BPSs may be more demanding. In our robustness checks we therefore test the extend to which leaving out the information on inital patents, initial firm size and wordscores affects prediction accuracy.
Omitting initial firm size reduces the predictive accuracy for involuntary exit and participation in an innovation subsidy program by 0.28 and 0.03 percentage points, respectively. Not surprisingly given that there is likely to be state dependence in firm size (Audretsch et al. 1999), it is more relevant for high employment growth where the average reduction is 4.65 percentage points. It also matters for applying for at least one new patent where the average reduction is 1.92 percentage points. We nevertheless obtain an "excellent" prediction accuracy for new patents and a "good" prediction accuracy for high employment growth.
Omitting initial patents from the specifications leaves the prediction accuracy for involuntary exit and high employment growth essentially unchanged. The predictive power for new patents only drops by 0.89 percentage points, despite state dependence in patenting activity being well documented (Blundell et al. 1995;Kaiser et al. 2015Kaiser et al. , 2018. It does matter, however, for predicting participation in an innovation subsidy program where the decrease is 4.83 percentage points on average and where we no longer achieve "good" predictive accuracy with a maximum AUC of 0.799, or 0.13 percentage points short of "good" category.
Leaving out the three variables that are either likely to be harder to gather or to process has hence very little effect on prediction accuracy overall.

CONCLUSIONS
Easily accessible and publicly available data, both textual and non-textual, are starting to become easily accessible in most modern economies. We show how such data can be used to predict the expected performance of newly started enterprises with substantial accuracy. Such performance predictions are of great importance to investors, creditors and policy makers alike.
Investors may not only want to assess the prospects of a business that asks for funding, they may also be interested in identifying promising startups before they even apply. Some investors have already embraced "algorithmic scoring" models (Corea 2018; Diffey 2019; Palmer 2017) and our paper indicates that it is well possible to successfully use such methods. Even though banks are unlikely to be equally proactive, they may as well want to more firmly base their debt financing decisions on objective data-driven grounds. Finally, policy makers may gain from the improved identification of promising startups in order to be better gear innovation support programs towards such firms and to improve the tailoring of startup promotion programs more generally.
For our predictions, we use data on the universe of Danish firms started between 2012 and 2014 to run simple logit regressions to show that key performance outcomes such as survival, employment growth, patenting activity as well as participation in competitive and audited innovation support programs can be predicted with high accuracy using publicly available data alone. Our models essentially only require "text as data" information that startups have to report when they register: startup names, founder identities, addresses and business purpose statements. Even though including hard-to-get or hard-to-process additional information on initial firm size, initial patents and an index of the relatedness of words used in the buiness purpose statements to aggregate startup performance improves prediction accuracy, such more intricate data is not necessary to forecast startup performance with substantial precision. However, even our most complex model was unable to predict returns on asset of above 20 percent with even modest accuracy.
Our finding that we are -apart for our outcome variable high return on assets -are able to forecast startup performance with substantial accuracy using publicly available data alone suggests that there are ample opportunities for the early identification of promising startups.
The fact that we use simple logit models, a standard workhorse in the anaylsis of binary outcomes, makes our approach applicable to a wide range of users.   The five sets of explanatory variables are (i) the basic G/S variables, (ii) the extended set of G/S variables, (iii) the human capital variables, (iv) the firm characteristics and (v) the BPS variables. "Val.' refers to the value of the respective test statistic while ∆ refers to its percentage change relative to the basic G/S model. Changes in the log-likelihood statistic cannot be calculated for the models not including the G/S variables. "Dof" denotes the degrees of freedom of the respective estimation model.