Rapid incidence estimation from SARS-CoV-2 genomes reveals decreased case detection in Europe during summer 2020

Abstract

By October 2021, 230 million SARS-CoV-2 diagnoses have been reported. Yet, a considerable proportion of cases remains undetected. Here, we propose GInPipe, a method that rapidly reconstructs SARS-CoV-2 incidence profiles solely from publicly available, time-stamped viral genomes. We validate GInPipe against simulated outbreaks and elaborate phylodynamic analyses. Using available sequence data, we reconstruct incidence histories for Denmark, Scotland, Switzerland, and Victoria (Australia) and demonstrate, how to use the method to investigate the effects of changing testing policies on case ascertainment. Specifically, we find that under-reporting was highest during summer 2020 in Europe, coinciding with more liberal testing policies at times of low testing capacities. Due to the increased use of real-time sequencing, it is envisaged that GInPipe can complement established surveillance tools to monitor the SARS-CoV-2 pandemic. In post-pandemic times, when diagnostic efforts are decreasing, GInPipe may facilitate the detection of hidden infection dynamics.

Download PDF

Introduction

As of August 2021, the global SARS-CoV-2 pandemic is still ongoing in most parts of the world, with 205 million reported cases worldwide. Novel vaccines of high efficacy have been developed within a year of the outbreak^1,2. At the time of writing, ~30% of the worlds population had already received at least one vaccination and 15.8% is fully vaccinated. However, the distribution of vaccines is uneven and achieving global herd immunity may pose an extremely difficult, long-term task^3,4. At the same time, novel variants of concern (VOC) have emerged in high prevalence regions^5,6, which may be able to reinfect individuals^7,8 and escape vaccine-elicited immune responses^9,10,11. For example, Manaus, Brazil, witnessed a massive second wave of infections¹², despite the fact that ~80% had already experienced an infection at the onset of the second wave⁵.

Because of the evolutionary versatility of SARS-CoV-2 and difficulties in global vaccine distribution, some experts expect that the virus may not be eliminated globally¹³. Even without adaptation to vaccines in the future, it has been postulated that SARS-CoV-2 may resurge^14,15 and surveillance may have to be maintained into the mid 2020s to monitor virus spread and evolution¹⁴.

Currently, the gold standard of SARS-CoV-2 surveillance is diagnostic testing via polymerase chain reaction (PCR) or antigen-based rapid diagnostic testing (RDT). Diagnostic test results currently define infection case reports, which are used to survey epidemiological dynamics and to define thresholds for travel bans and non-pharmaceutical measures. Inevitably, case reporting data are affected by test coverage, which changes when testing policies are adapted. While RDT enables point-of-care diagnosis and is less costly than PCR testing^16,17, gathering and reporting of test results still requires a sophisticated infrastructure, which is difficult to establish and maintain in many developing countries¹⁸. Independent and complimentary sources of information, such as social media reports^19,20 or waste water analysis^21,22 have been used early on to complement our knowledge of the pandemic dynamics. In addition, many regions of the world sequence SARS-CoV-2 genomes to track virus evolution and the emergence of VOC. The gathered viral sequences are regularly provided to public databases, such as GISAID^23,24. The genetic data readily holds information about the pandemic trajectory. In this work, we take advantage of the fact that the speed at which SARS-CoV-2 evolves on the population level contains information about the number of individuals who are actively infected.

In the vast majority of cases, SARS-CoV-2 is transmitted within a very short period, only days after infection^25,26. The consequence is a well-defined duration of intra-patient evolutionary time before transmission. Thus, the number of actively infected individuals is correlated to the rate of divergence of the viral population, implicating an evolutionary signal.

In this article, we introduce the computational pipeline GInPipe, which uses time-stamped sequencing data alone, extracts the evolutionary signal and reconstructs SARS-CoV-2 incidence histories. The approach is inspired by recent work by Khatri and Burt²⁷, who derived a simple function for the estimation of the current effective population size. Herein, due to the short window of transmission, we anticipate that the effective population size may strongly correlate with the incidence of SARS-CoV-2. We adapt the function derived in ref. ²⁷ and embed it into an automatic computational pipeline (GInPipe) that reconstructs the time course of an incidence correlate ϕ merely from SARS-CoV-2 genetic data. GInPipe is validated threefold and performs robustly: (i) against in silico generated outbreak data, (ii) against phylodynamic analyses and (iii) in comparison with case reporting data. We applied the method to SARS-CoV-2 sequencing data from Denmark, Scotland, Switzerland, and the Australian state Victoria to reconstruct their respective incidence histories. Lastly, we utilize the inferred epidemic trajectories to compute changes in the probability that an infected individual is reported and highlight how this probability is affected by changes in testing policies.

Results

Incidence reconstruction

An outline of GInPipe for SARS-CoV-2 incidence reconstruction is shown in Fig. 1a–c. After compiling a set of time-stamped, full-length SARS-CoV-2 genomes, the sequences are assigned to consecutive subsets according to their sampling dates (temporal bins) (Fig. 1a). For each temporal bin b, we compute the number of sequences different from a reference (mutant sequences m_b), as well as the number of unique sequences (haplotypes h_b). These two inputs are used to infer the incidence correlate ϕ_b (Fig. 1b). The ϕ_b point estimates are smoothed to derive a reconstructed incidence history along the time axis (Fig. 1c). The reconstructed incidence correlates can then be used as a basis to estimate the effective reproduction number R_e, as well as the relative case detection rate as outlined below.

**Fig. 1: Reconstruction of incidence histories using the proposed method.**

Method validation: in silico experiment

To confirm that GInPipe is able to reconstruct incidence histories, we performed an in silico experiment. We considered a population of N(t) infected individuals at time t that stochastically generate N(t + 1) infected individuals in the next time step t + 1. Each individual is associated with a virus sequence, which can mutate randomly. Individuals can be removed (the associated sequence is removed), or they transmit their virus (the associated virus is copied over). We record the number of infected individuals per generation, as well as all sequences of the currently circulating viruses. We then use the simulated viral sequences to infer ϕ(t) and reconstruct the incidence history, as presented in Fig. 1d, e.

In Fig. 1d, we compare one trajectory of simulated population sizes with the reconstructed incidence histories. The simulated outbreak (red line, right axis) consists of two waves of increasing magnitude. GInPipe robustly reconstructs these dynamics (blue lines and dots, left axis), although the incidence correlates ϕ(t) is on a different scale, implying a linear correlation to the number of infected individuals. To assess this correlation, we performed 10 stochastic simulations and compared the ϕ(t) point estimates with the corresponding number of infected individuals (Fig. 1e). We observed a strong (Pearson correlation coefficient of r = 0.98) and highly significant (p < 10⁻¹⁶) linear relationship between the number of infected individuals N(t) and the method's incidence correlate ϕ(t).

GInPipe also allows to infer the effective reproduction number R_e from the incidence correlates ϕ(t) (details in the "Methods" section). To further assess the accuracy of GInPipe, we compare the \({R}_{e}^{\phi }\) values inferred with the smoothed ϕ estimates versus \({R}_{e}^{\,{{{{\rm{true}}}}}\,}\) values calculated from the simulated pandemic N_true. Figure 1f shows the identity plot for \({{{{{{\mathrm{log}}}}}}}\,({R}_{e}^{\,{{{{\rm{true}}}}}\,})\) vs. \({{{{{{\mathrm{log}}}}}}}\,({R}_{e}^{\phi })\), with the respective proportion of qualitatively agreeing or disagreeing predictions in the four quadrants: The top right and bottom left quadrant represents the true positive (TP) and true negative (TN) estimates, and the top left and bottom right quadrants show the false positive (FP) and false negative (FN) estimates respectively. The qualitative accuracy of GInPipe based on the R_e values was calculated as \(\frac{{{{{\mathrm{TP}}}}}+{{{{\mathrm{TN}}}}}}{{{{{\mathrm{TP}}}}}+{{{{\mathrm{TN}}}}}+{{{{\mathrm{FP}}}}}+{{{{\mathrm{FN}}}}}}\), yielding a value of 0.92. In terms of quantitative agreement of the R_e estimates, the coefficient of determination was R² = 0.77.

While these simulations represent idealized scenarios, in Supplementary Note 1 we thoroughly evaluated the robustness of GInPipe to incomplete, and sparse data sets, unbalanced and temporally changing sampling rates, to the introduction of unrelated sequence variants, its ability to reconstruct non-smooth pandemic dynamics, as well as its sensitivity to changes in the pathogen mutation rate and selective pressure.

Our analyses showed, that the method can still reliably reconstruct incidence histories over time when data are missing, or when the sampling rate changes over time. In scenarios of extreme under-sampling, the ϕ point estimates have the tendency to yield lower values. However, through the smoothing step the reconstructed incidence trajectories still follow the overall population dynamics (Supplementary Note 1, section SN.1.7). If the sampling changes the evolutionary signal, for example by sampling sequences based on their similarity (and hence lowering the signal), the incidence correlates tend to decrease (Supplementary Note 1, section SN.1.9). If the sampling strategy does not change over the course of the pandemic, GInPipe can still reconstruct the overall population dynamic. However, with altering sampling strategies that perturb the evolutionary signal, difficulties with incidence reconstruction may arise. Therefore, as with, e.g., phylodynamic methods, a consistent strategy of deducing representative samples is believed to ensure GInPipe's performance. We found that selective pressure has no effect on the incidence reconstruction with GInPipe (Supplementary Note 1, section SN.1.14).

If mutation rates become too low, which may be the case for other respiratory infections, and hence not enough signal is given in the data, GInPipe becomes less accurate, but the incidence can still be reconstructed at the cost of time-resolution (Supplementary Note 1, section SN.1.15).

Finally, we evaluated whether introductions of foreign sequences affect the reconstruction of incidence histories. Even for extreme and unrealistic cases, a stable reconstruction of the underlying dynamic is possible. Yet, a tendency of overestimation can be observed if the introduced sequences constitute more than 10% of the data set and if they do not continue to contribute to the pandemic after their introduction (Supplementary Note 1, section SN.1.12).

Method validation: phylodynamics

Phylodynamic methods combine phylogeny reconstruction with epidemic models. For example, the piecewise constant birth-death sampling process (BDSKY)²⁸ implemented in BEAST2²⁹, allows the reconstruction of the effective reproduction numbers R_e(τ) for given time periods τ.

We conducted phylodynamic analyses of SARS-CoV-2 sequence data from Denmark, Scotland, Switzerland, and the Australian state Victoria. In analyzing the data we assumed that \({R}_{e}^{{{{{{{{\rm{BEAST}}}}}}}}}(\tau )\) was piecewise constant in between major changes in SARS-CoV-2 non-pharmaceutical interventions (intervals stated in Supplementary Note 2). We then used BEAST2 to estimate \({R}_{e}^{{{{{{{{\rm{BEAST}}}}}}}}}(\tau )\) alongside the tree reconstructions.

In parallel, we estimated corresponding effective reproduction numbers \({R}_{e}^{{{{{{{{\rm{\phi }}}}}}}}}(t)\) by applying the Wallinga–Teunis method³⁰ to incidence correlates ϕ derived by GInPipe. For both methods, we used publicly available full length SARS-CoV-2 sequencing data from GISAID^23,24 (Supplementary Note 4).

Results of both methods are presented in Fig. 2. Overall, both methods show congruent trends for the analyzed countries, when comparing the piecewise constant \({R}_{e}^{{{{{{{{\rm{BEAST}}}}}}}}}(\tau )\) from phylodynamic analysis with the median daily \({R}_{e}^{{{{{{{{\rm{\phi }}}}}}}}}(t)\) for the same interval. Noteworthy, GInPipe allows for a much finer time-resolution (daily R_e estimates) compared to the piecewise constant R_e estimates on pre-defined intervals, obtained from the phylodynamic analysis.

**Fig. 2: Effective reproduction number R_e estimates using the proposed method (ϕ) and phylodynamics (BEAST2).**

For Denmark, the first interval spans the decline in the number of infections after the first wave (end of April to mid June). Consequently, we observe R_e(τ) < 1 using both methods. For the next intervals, the median or piece-wise constant R_e(τ) is predicted to be around, or slightly larger than one. However, GInPipe reconstructs a number of peaks in the daily \({R}_{e}^{{{{{{{{\rm{\phi }}}}}}}}}(t)\) estimates, most pronounced in August, coinciding with the summer holidays in Europe. In the interval from November to mid December the estimates deviate slightly, with a larger median estimate from BEAST2, however, both interval estimates are predicted to be R_e(t) > 1 and the confidence intervals overlap entirely.

The GInPipe R_e(τ) estimates for Scotland are within 20% of the corresponding BEAST2 estimates, where GInPipe again allows for a much finer time resolution. Once again, we see a peak in the summer (August–September 2020), coinciding with the summer holidays in Europe. For the last interval (from December 2020) both methods show a median R_e(t) > 1, again with a slightly higher median BEAST2 estimate, coinciding with the second wave of infections.

For Switzerland, the estimates disagree slightly, particularly in the first interval (mid March to mid May), which spans both sides of the peak number of infections during the first wave. Although both methods predict a median R_e(τ) < 1, the absolute value differs in magnitude between the two methods, with BEAST2 estimating a much lower value. The lower estimate from the BEAST2-analysis in the first interval may be explained by the approximation of transmission clusters, which results in the reconstruction of a relatively high number of transmission events many of which may have occurred outside Switzerland (Supplementary Note 2, Fig. SN.29 therein, tree B.1). In the daily estimates, we see a transition from \({R}_{e}^{{{{{{{{\rm{\phi }}}}}}}}}(t) \, > \, 1\) to \({R}_{e}^{{{{{{{{\rm{\phi }}}}}}}}}(t) \, < \,1\), which may explain why the median prediction with GInPipe is close to one for the entire interval. The estimates are qualitatively different for the second interval (mid May–mid June), where GInPipe estimates \({R}_{e}^{{{{{{{{\rm{\phi }}}}}}}}}(\tau ) \, < \, 1\), while BEAST2 estimates \({R}_{e}^{{{{{{{{\rm{BEAST}}}}}}}}}(\tau )\approx 1\). Again, GInPipe estimates a peak in summer (mid June to mid August R_eϕ(τ) > 1). While BEAST2 predicts the onset of transmission in the second wave to already start in mid August (R_e(τ) > 1), GInPipe estimates the first major rise in infections at the end of September.

For Victoria we observe an \({R}_{e}^{{{{{{{{\rm{\phi }}}}}}}}}(t) \, > \, 1\) until mid March in the daily estimates. Overall, R_e is < 1 for the first interval between mid March and May, versus R_e > 1 between June and August. Again, we see various peaks around June and July in the daily R_e estimates with the proposed method. For the final interval, both methods slightly disagree, with \({R}_{e}^{{{{{{{{\rm{BEAST}}}}}}}}} \, < \, 1\) and \({R}_{e}^{{{{{{{{\rm{\phi }}}}}}}}}(\tau ) \, > \,1\), though the daily \({R}_{e}^{{{{{{{{\rm{\phi }}}}}}}}}(t)\) are decreasing towards the end of the final interval.

In addition to the phylodynamic inference of \({R}_{e}^{{{{{{{{\rm{BEAST}}}}}}}}}\), we also implemented phylodynamic incidence reconstruction using EpiInf for Scotland³¹. Incidence trajectories from EpiInf, GInPipe and reported cases are shown in Supplementary Fig. 1. GInPipe estimates the timing of the first (April 2020) and second wave (November 2020) in congruence with the reported cases, while EpiInf estimates the first wave to occur mid May and may underestimate the magnitude of the second wave. In addition, EpiInf estimates a peak in August that is not represented in the reporting data, nor in GInPipe's estimates. With regards to the third wave (January 2021), both EpiInf and GInPipe disagree with the rapid decline seen in the reported cases from January 2021.

In terms of computational time, the entire GInPipe analysis pipeline runs in 25 min on the full Denmark data set (n = 40.575 sequences) and in 7 min on the Victoria data set (n = 10.710 sequences) on a single notebook (2.3 Ghz, 2 cores). Furthermore, GInPipe does not require to pre-assign any intervals, to exclude particular strains, construct a phylogenetic tree, or cluster sequences based on their phylogenetic relationship. The BEAST2 analysis alone required about 15 h on an Intel Xeon E5-2687W (3.1 Ghz, 2 × 12 cores) on a sub-sampled data set (n ≈ 2500 sequences) with additional computation time needed to construct a multiple sequence alignment and approximate transmission clusters. Despite recent advances to improve the application of phylogenetic methods to large genomic data sets³² (https://beast.community/thorney_beast), these methods remain computationally expensive and advanced knowledge is required to apply them properly to bigger data sets.

Reconstructed incidence histories

We used GInPipe to reconstruct complete incidence histories for Denmark, Scotland, Switzerland, and Victoria (Australia) from publicly available full-length SARS-CoV-2 sequencing data provided through GISAID^23,24 (Supplementary Note 4). In Fig. 3, we compare the reconstructed incidence histories (blue lines and dots, left axis) to the 7-day rolling average of officially reported new cases (red line, right axis). Overall, the reconstructed incidence estimates reflect the different pandemic waves deduced from the reporting data, although there are quantitative differences between the reconstructed and reported incidence trajectories over time. In particular, during the first wave in Scotland, and Victoria (Fig. 3b, d) our method estimates higher incidences than reported, whereas the curves align at later points for the second and third waves. It is worth mentioning that testing capacities were particularly low in Scotland in April (during the first wave), suggesting extensive under-reporting in the initial phase of the pandemic. This is also supported by test positive rates of almost 40% during April 2020 in Scotland (Supplementary Fig. 2). In Victoria, sufficient testing capacities were not available until May, but test positive rates were already declining from April to May (Supplementary Fig. 2). This indicates that the first wave may have been under-reported in magnitude, but had vanished by May.

**Fig. 3: Incidence reconstruction based on sequencing data.**

Interestingly, the proposed incidence reconstruction method predicts small summer waves in August in the three European countries (Fig. 3a–c) that are not visible in the reporting data. In the incidence reconstruction method these summer waves are immediately followed by the second SARS-CoV-2 wave. For the second wave, the profiles of the reconstructed incidence histories match the profiles of the reported cases, particularly in Denmark, Scotland, and Victoria (Fig. 3a, b, d). For Scotland, our method predicts a more long-lasting third wave with rising incidence rates until February 2021 and a moderate decline with several smaller peaks until May, whereas the reporting data indicate a peak in January 2021 with a subsequent fast regression. The argument, that ongoing vaccination in Great Britain could explain the immediate decline of reported infected cases, can be objected with the fact, that by March 2021 only about 2% of the Scottish population were fully vaccinated. Moreover, phylodynamic incidence reconstruction using EpiInf³¹ (Supplementary Fig. 1) also suggests a more long-lasting third wave in Scotland.

For Switzerland, we predict a larger wave around January–February 2021 (third wave) that is not reflected in the reporting data. Towards the end of the prediction horizon, from March 2021 onwards, the reported cases and the incidence estimation both indicate a rise in numbers (fourth wave).

In addition to the countries analyzed above, we further reconstruct incidence trajectories for Japan, Chile, India, and South Africa for the entire time span from the onset of the pandemic until mid 2021, see Supplementary Fig. 3. They demonstrate GInPipe's ability to reconstruct incidence histories with very limited sequencing data. Particularly for Chile, India, and South Africa, the amount of accessible data are considerably sparser than for the countries analyzed in Fig. 3. All four pandemic waves for Japan and the two major waves for South Africa were reconstructed. For India, and to some extent Chile, the reconstructions indicate sustained high-level spread from early 2020 until February 2021, when the pandemic started to expand massively.

Relative case detection rate

We investigated whether the proposed incidence reconstruction method may be used to learn about the proportion of infected cases that are actually tested, detected and reported, P_t(tested∣infected).

The proportion of SARS-CoV-2 infected who are actually reported can be calculated using Bayes' formula (see the "Methods" section). In order to perform the calculation, the proportion of actively infected individuals in the population P_t(infected) needs to be known. We have shown that the incidence correlates ϕ from our method are proportional to the number of infected individuals, c ⋅ ϕ_t = N_eff (Figs. 1d–e, 3), and hence to the probability of being infected P_t(infected). Consequently, we may use the reconstructed incidence profiles, together with the test sensitivity and specificity, the respective information about the proportion of positive tests, as well as the testing capacities for each country or region to calculate changes in the case detection rate, scaled by unknown factor c.

In Fig. 4, we show the \({{{{{{{\mathrm{log}}}}}}}\,}_{2}\) scaled detection probabilities for Denmark, Scotland, Switzerland, and Victoria (Australia). The log scaling allows us to easily gauge the relative change in (under-)detection of the infected population over time (e.g., twofold, fourfold increase or decrease in case detection rate). The dashed vertical lines in the graphics indicate major changes in testing policies in the respective countries. Individual parameters used in the inference procedure, P(tested), P(inf∣tested), and c ⋅ P(infected) are shown in Supplementary Fig. 2.

**Fig. 4: Relative case detection rate.**

For Denmark, we observe an initial period of massive SARS-CoV-2 under-detection in the beginning of March 2020, Fig. 4a (upper panel), which coincides with very low testing capacities at the beginning of the pandemic (Fig. 4a, lower panel). From mid March, case detection stabilizes at a sixfold higher level, compared to the first week of March. The second interval begins around mid May with an important policy change, allowing every citizen to get tested without medical referral. Interestingly, compared to the fairly stable case detection levels from mid March to mid May, this policy change leads to a 2–3 fold drop in case detection in the summer months from July to September. Of note, while everybody is granted the possibility to test for SARS-CoV-2, testing capacities remained fairly unchanged (Fig. 4a, lower panel). According to our calculations, the largest proportion of infections remained undetected in July. From end of August, testing capacities were steadily increased in Denmark (Fig. 4a, lower panel), particularly in Copenhagen and at the airports, followed by prioritized testing. From September on, this leads to a nearly eightfold increase of the case detection rate, with a peak in December. From end of December the detection rate drops more than fourold, despite continuous testing.

For Scotland (Fig. 4b), the earliest test data are available only from the end of March. Therefore, the data captures only the second part of the first wave, compare Fig. 3b. In the beginning of May, testing capacities were more than doubled (Fig. 3b, lower panel) and outbreak investigation intensified. This led to a doubling of the relative case detection rate from May, compared to the first phase. On 18 May, SARS-CoV-2 testing was opened for everyone with symptoms. However, only in July testing capacities were increased. This may have led to a drop in case detection from mid May to July, after which case detection increased and remained during August at roughly the levels achieved in May. After 25 August, testing capacities and accessibility of testing steadily increased. Accordingly, case detection increased about sixfold until winter 20/21. From 25 November, testing capacities were further expanded, especially in the health sector, including hospital patients, health and social care staff, with fairly stable case detection rates. Further increase of testing capacities in the end of December allowed to double the probability to detect infected individuals. From the beginning of the year 2021, the Scottish government pushed community testing in areas with high SARS-CoV-2 prevalence. At the same time, the proportion of positive tests start to decline (Supplementary Fig. 2), and consequently the case detection rate collapses until April by ninefold.

Similar to Denmark, Switzerland shows an initial period of massive SARS-CoV-2 under-detection in the beginning of March 2020 (Fig. 4c, upper panel), which coincides with very low testing capacities at the beginning of the pandemic (Fig. 4c, lower panel). When testing capacities increase by mid March, case detection rates grow 8-fold. However, from the beginning of April, we observe a drop in the probability to detect infections that lasts until mid May (overall 10-fold drop). This trend coincides with a drop of positivity rates (Supplementary Fig. 2), as well as the extension of testing criteria on 22nd April: From this date, anybody with symptoms were allowed to get tested, despite the fact that the availability of tests was not increased (Fig. 4c, lower panel). From 18 May, tests were partly prioritized for hospitalized and vulnerable individuals. At the same time, testing capacities steadily increased and incidences dropped. As a net effect, the probability of detecting infected people increases steadily to a maximum at the end of October with a relative difference of nearly 20-fold compared to the low point in mid May. On 2 November, Switzerland begins to supply antigen-based RDT for self-testing as part of their COVID containment strategy. Interestingly, our model predicts that this led to a sharp decline in case detection, again corresponding with the decline in positivity rates (Supplementary Fig. 2). From 21 February 2021, further precautionary actions were taken, and the government recommended repeated testing. This is associated with a stable, but relatively low detection rate for infected people until the end of April 2021.

For the Australian state Victoria, the earliest data were available from end of March 2020 (Fig. 4d), capturing the second part of the first SARS-CoV-2 wave. Detection probabilities in the first interval, until 14th April were changed proportionally to the test capacities during that interval (Fig. 4d, upper and lower panel). On 14 April 2020, the testing criteria were expanded, allowing anyone with COVID-like symptoms to be tested. Unlike the situation in Switzerland, where we observed a downward trend in case detection after expanding the testing criteria (Fig. 4c), the detection probability in Victoria remains stable until the end of April. In contrast to Switzerland, testing capacities were increased when testing criteria were expanded. On 30 April, the government initiated a two-week testing blitz, a large, coordinated testing campaign to locate viral spread. The testing blitz was accompanied by mass sewerage testing and matched with a massive increase of testing capacities, which led, according to our simulations, to a fourfold increase in the probability to detect infected individuals. At the end of the testing blitz, testing capacities steadily decreased and the proportion of detected infections decreased drastically (by roughly ninefold). At the beginning of June, testing capacities rose again, matched by a rise in the proportion of detected cases. From 1 July onwards, several testing blitzes were conducted in outbreak regions, which seemed to have stabilized case detection rates during the second wave of infections. After the second wave (end of August– September, Fig. 3d), case detection rates drop. From October 2020 onwards, our predictions become highly unreliable, as the incidence estimates credibility interval includes zero (compare Fig. 3d), which concludes that the case detection rate cannot be determined anymore.

In general, we make two striking observations: Firstly, and quite intuitively, whenever more tests were conducted, the proportion of detected SARS-CoV-2 cases increases. Secondly, and unexpectedly, whenever testing criteria were relaxed, this led to a drop in the probability of case detection. We see this drop in mid May in Denmark and Scotland and in mid April in Switzerland. Importantly, the expansions of testing criteria were not-, or insufficiently matched by increased testing capacities. Quite surprisingly, our simulations for Switzerland suggested a drop in case detection when antigen-based RDT self-testing became part of the national diagnostic strategies.

Discussion

SARS-CoV-2 continues to spread around the world, making epidemiological and molecular surveillance indispensable for the evaluation and guidance of public health interventions.

National and international sequencing efforts are underway that closely monitor the dynamics and evolution of the virus. In the global fight against SARS-CoV-2, many reconstructed sequences have been made broadly available through public databases, such as GISAID^23,24 and the COVID data portal. In this work, we introduce GInPipe, a pipeline that utilizes this data to reconstruct SARS-CoV-2 incidence histories.

Viral infections are often characterized by a transmission bottleneck³³, where only a very small number of viruses initiate the infection and subsequently replicate within the host. A sufficient number of viruses (viral load) is required for further transmission. Hence, the temporal window of infectiousness begins with the intra-host viral population reaching a sufficiently large abundance and ends with the virus becoming eliminated by the immune system (or drugs). In SARS-CoV-2, this window only spans a few days and consequently the virus is almost always transmitted within days after infection, in contrast to HIV, HBV or HCV^25,26. If neutral or favourable mutations occur during this time, they may become abundant enough to be passed on to other hosts³³. The consequence is a well-defined duration of intra-patient evolutionary time in which the virus can randomly mutate and become transmitted subsequently. In SARS-CoV-2, this intra-patient evolutionary time appears to be short and the analysis of outbreak clusters indicates that the virus genomes from linked cases were separated by either none, or very few mutations^34,35,36. The brevity of evolutionary time before transmission may thus result in relatively homogeneous evolutionary changes between consecutive cases, which would imply strong correlations between evolutionary changes and the number of infections. This evolutionary signal allows GInPipe to reconstruct SARS-CoV-2 incidence histories solely from time-stamped viral genomes.

This presumption may also hold for other respiratory viruses, depending on the rate at which they evolve. In Supplementary Note 1, section SN.1.15, we analyzed whether GInPipe is sensitive to changes in the evolutionary rate. We found that, as long as the evolutionary rate is sufficiently high to produce a measurable evolutionary signal, GInPipe can reliably reconstruct incidence histories.

However, we would expect the method to work less well for sexually transmitted or blood-borne diseases caused by HIV, HBV, or HCV, where the virus can continuously evolve for weeks or years before being transmitted, causing a very heterogeneous signal that may fail to link viral evolution to the number of infections. For example, for a chronic infection like HIV, the consensus sequence in an individual changes over time, even without any onward infections taking place. Therefore, particularly for chronic viral infections, population-level viral evolution is likely both affected by the number of infected individuals, and the generation time of the infection (the average time to pass on the infection)³⁷.

In the past, numerous approaches have been published, with the aim to estimate the effective population size from genetic properties (reviewed in refs. ^38,39). A variety of methods utilize the information of temporal changes in allele frequency (reviewed in ref. ³⁸), while others build on population genetic theory and phylodynamic reconstruction^40,41,42. GInPipe is inspired by the recent works of Khatri and Burt²⁷, which has foundations in population genetic theory. Khatri and Burt derived a method to infer the current effective population size with soft selected sweeps from fixated mutations of different origins. They derived a simple function of the mean number of origins and the current allele frequency.

In contrast to ref. ²⁷, we are interested in the history of the effective population size. Therefore, we seek to assess the effective population size per time instance, using time-stamped sequences that are assigned to bins of temporally adjacent sequences. For each bin, we investigate the current population. We utilize the number of haplotypes as an approximation for the mutational input. Akin to the equation in Khatri and Burt²⁷, the raw evolutionary signal is put in relation to the number of muta...

Naegleria Fowleri Symptoms