Snapshot Semantics for Temporal Multiset Relations (Extended Version)

Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations are seen as sequences of snapshot relations, and queries are evaluated at each snapshot. In this work, we demonstrate that current approaches for snapshot semantics over interval-timestamped multiset relations are subject to two bugs regarding snapshot aggregation and bag difference. We introduce a novel temporal data model based on K-relations that overcomes these bugs and prove it to correctly encode snapshot semantics. Furthermore, we present an efficient implementation of our model as a database middleware and demonstrate experimentally that our approach is competitive with native implementations and significantly outperforms such implementations on queries that involve aggregation.


INTRODUCTION
Recently, there is renewed interest in temporal databases fueled by the fact that abundant storage has made long term archival of historical data feasible.This has led to the incorporation of temporal features into the SQL:2011 standard [27] which defines an encoding of temporal data associating each tuple with a validity period.We refer to such relations as SQL period relations.Note that SQL period relations use multiset semantics.Period relations are supported by many DBMSs, e.g., PostgreSQL [34], Teradata [44], Oracle [30], IBM DB2 [35], and MS SQLServer [29].However, none of these systems, with the partial exception of Teradata, supports snapshot semantics, an important class of temporal queries.Given a temporal database, a non-temporal query Q interpreted under snapshot semantics returns a temporal relation that assigns to each point in time the result of evaluating Q over the snapshot of the database at this point in time.This fundamental property of snapshot semantics is known as snapshot-reducibility [28,42].A specific type of snapshot semantics is the so-called sequenced semantics [7] which in addition to snapshot-reducibility enforces an-  other property called change preservation that determines how time points are grouped into intervals in a snapshot query result.
Example 1.1 (Snapshot Aggregation).Consider the SQL period relation works in Figure 1a that records factory workers, their skills, and when they are on duty.The validity period of each tuple is stored in the temporal attribute period.
To simplify examples, we restrict the time domain to the hours of 2018-01-01 represented as integers 00 to 23.The company requires that at least one SP worker is in the factory at any given time.This can be checked by evaluating the following query under snapshot semantics.
Q onduty : SELECT count( * ) AS cnt FROM works WHERE skill = 'SP' Evaluated under snapshot semantics, a query returns a snapshot (time-varying) result that records when the result is valid, i.e., Q onduty returns the number of SP workers that are on duty at any given point of time.The result is shown in Figure 1b.For instance, at 08:00am two SP workers (Ann and Joe) are on duty.The query exposes several safety violations, e.g., no SP worker is on duty between 00 and 03.
In the example above, safety violations correspond to gaps, i.e., periods of time where the aggregation's input is empty.As we will demonstrate, all approaches for snapshot semantics that we are aware of do not return results for gaps (tuples marked in red) and, therefore, violate snapshot-reducibility.Teradata [44, p.149] for instance, realized the importance of reporting results for gaps, but in contrast to snapshot-reducibility provides gaps in the presence of grouping, while omitting them otherwise.As a consequence, in 1 arXiv:1902.04938v1[cs.DB] 13 Feb 2019 our example these approaches fail to identify safety violations.We refer to this type of error as the aggregation gap bug (AG bug).
Similar to the case of aggregation, we also identify a common error related to snapshot bag difference (EXCEPT ALL).
Example 1.2 (Snapshot Bag Difference).Consider again Figure 1.Relation assign records machines (mach) that need to be assigned to workers with a specific skill over a specific period of time.For instance, the third tuple records that machine M3 requires a non-specialized (NS) worker for the time period [03, 16).To determine which skill sets are missing during which time period, we evaluate the following query under snapshot semantics: The result in Figure 1c indicates that one more SP worker is required during the periods [06, 08) and [10,12).
Many approaches treat bag difference as a NOT EXISTS subquery, and therefore do not return a tuple t from the left input if this tuple exists in the right input (independent of their multiplicity).For instance, the two tuples for the SP workers (highlighted in red) are not returned, since there exists an SP worker at each snapshot in the works relation.This violates snapshot-reducibility.We refer to this type of error as the bag difference bug (BD bug).
The interval-based representation of temporal relations creates an additional problem: the encoding of a temporal query result is typically not unique.For instance, tuple (Ann, SP , [03, 10)) from the works relation in Figure 1 can equivalently be represented as two tuples (Ann, SP , [03, 08)) and (Ann, SP , [08, 10)).We refer to a method that determines how temporal data and snapshot query results are grouped into intervals as an interval-based representation system.A unique and predictable representation of temporal data is a desirable property, because equivalent relational algebra expressions should not lead to syntactically different result relations.This problem can be addressed by using a representation system that associates a unique encoding with each temporal database.Furthermore, overlap between multiple periods associated with a tuple and unnecessary splits of periods complicate the interpretation of data and, thus, should be avoided if possible.Given these limitations and the lack of implementations for snapshot semantics queries over bag relations, users currently resort to manually implementing such queries in SQL which is time-consuming and error-prone [39].We address the above limitations of previous approaches for snapshot semantics and develop a framework based on the following desiderata: (i) support for set and multiset relations, (ii) snapshot-reducibility for all operations, and (iii) a unique interval-based encoding of temporal relations.Note that while previous work on sequenced semantics (e.g., [16,18]) also aims to support snapshot-reducibility, we emphasize a unique encoding instead of trying to preserve intervals from the input of a query.We address these desiderata using a three-level approach.Note that we focus on data with a single time dimension, but are oblivious to whether this is transaction time or valid time.First, we introduce an abstract model that supports both sets and multisets, and by definition is snapshot-reducible.This model, however, uses a verbose encoding of temporal data and, thus, is not practical.Afterwards, we develop a more compact logical model as a representation system, where the complete temporal history of all equivalent tuples from the abstract model is stored in an annotation attached to one tuple.The abstract and the logical models leverage the theory of Krelations, which are a general class of annotated relations that cover both set and multiset relations.For our implementation, we use SQL over period relations to ensure compatibility with SQL:2011 and existing DBMSs.We prove the equivalence between the three layers (i.e., the abstract model, the logical model and the implementation) and show that the logical model determines a unique interval-encoding for the implementation and a correct rewriting scheme for queries over this encoding.
Our main technical contributions are: • Abstract model: We introduce snapshot K-relations as a generalization of snapshot set and multiset relations.These relations are by definition snapshot-reducible.• Logical model: We define an interval-based representation, termed period K-relations, and prove that these relations are a compact and unique representation system for snapshot semantics over snapshot K-relations.We show this for the full relational algebra plus aggregation (RA agg ).• We achieve a unique encoding of temporal data as period Krelations by generalizing set-based coalescing [10].
• We demonstrate that the multiset version of period Krelations can be encoded as SQL period relations, a common interval-based model in DBMSs, and how to translate queries with snapshot semantics over period K-relations into SQL.• We implement our approach as a database middleware and present optimizations that eliminate redundant coalescing steps.We demonstrate experimentally that we do not need to sacrifice performance to achieve correctness.

RELATED WORK
Temporal Query Languages.There is a long history of research on temporal query languages [6,22].Many temporal query languages including TSQL2 [38,40], ATSQL2 (Applied TSQL2) [8], IXSQL [28], ATSQL [9], and SQL/TP [46] support sequenced semantics, i.e., these languages support a specific type of snapshot semantics.In this paper, we provide a general framework that can be used to correctly implement snapshot semantics over period set and multiset relations for any language.
Interval-based Approaches for Sequenced Semantics.In the following, we discuss interval-based approaches for sequenced semantics.Table 1 shows for each approach whether it supports multisets, whether it is free of the aggregation gap and bag difference bugs, and whether its interval-based encoding of a sequenced query result is unique.An N/A indicates that the approach does not support the operation for which this type of bug can occur or the semantics of this operation is not defined precisely enough to judge its correctness.Note that while temporal query languages may be defined to apply sequenced semantics and, thus, by definition are snapshot-reducible, (the specification of) their implementation might fail to be snapshot-reducible.In the following discussion of the temporal query languages in Table 1, we refer to their semantics as provided in the referenced publication(s).
Interval preservation (ATSQL) [9, Def.2.10] is a representation system for SQL period relations (multisets) that tries to preserve the intervals associated with input tuples, i.e., fragments of all intervals (including duplicates) associated with the input tuples "survive" in the output.Interval preservation is snapshot-reducible for multiset semantics for positive relational algebra [36] (selection, projection, join, and union), but exhibits the aggregation gap and bag difference bug.Moreover, the period encoding of a query result is not unique as it depends both on the query and the input representation.Teradata [44] is a commercial DBMS that supports sequenced operators using ATSQL's statement modifiers.The implementation is based on query rewriting [2] and does not support difference.Teradata's implementation exhibits the aggregation gap Interval preservation [9] (ATSQL) × × × Teradata [44] × N/A ×1 Change preservation [16,18] × × N/A × TSQL2 [38,40,42] × N/A N/A ATSQL2 [8] N/A × × TimeDB [43] (ATSQL2) N/A × × SQL/Temporal [41] × × × SQL/TP [46] 2 × Our approach bug.Since the application of coalescing is optional, the encoding of snapshot relations as period relations is not unique.Change preservation [18,Def. 3.4] determines the interval boundaries of a query result tuple t based on the maximal interval for which there is no change in the input.To track changes, it employs the lineage provenance model in [16] and the PI-CS model in [18].The approach uses timestamp adjustment in combination with traditional database operators, but does not provide a unique encoding, exhibits the AG bug, and only supports set semantics.Our work addresses these issues and significantly generalizes this approach, in particular by supporting bag semantics.TSQL2 [38,40,42] implicitly applies coalescing [10] to produce a unique representation.Thus, it only supports set semantics, and it does not support aggregation.Snodgrass et al. [41] present a validtime extension of SQL/Temporal and an algebra with sequenced semantics.The algebra supports multisets, but exhibits both the aggregation gap and bag difference bug.Since intervals from the input are preserved where possible, the interval representation of a snapshot relation is not unique.TimeDB [43] is an implementation of ATSQL2 [8].It uses a semantics for bag difference and intersection that is not snapshot-reducible (see [43, pp. 63]).Our approach is the first that supports set and multiset relations, is resilient against the two bugs, and specifies a unique interval-encoding.
Non-sequenced Temporal Queries.Non-sequenced temporal query languages, such as IXSQL [28] and SQL/TP [46], do not explicitly support sequenced semantics.Nevertheless, we review these languages here since they allow to express queries with sequenced semantics.SQL/TP [46] introduces a point-wise semantics for temporal queries [12,45], where time is handled as a regular attribute.Intervals are used as an efficient encoding of time points, and a normalization operation is used to split intervals.The language supports multisets and a mechanism to manually produce sequenced semantics.However, sequenced semantics queries are specified as the union of non-temporal queries over snapshots.Even if such subqueries are grouped together for adjacent time points where the non-temporal query's result is constant this still results in a large number of subqueries to be executed.Even worse, the number of subqueries that is required is data dependent.Also, the interval-based encoding is not unique, since time points are grouped into intervals depending on query syntax and encoding of the input.While this has no effect on the semantics since SQL/TP queries cannot distinguish between different interval-based encodings of a temporal database, it might be confusing to users that observe different query results for equivalent queries/inputs.
Implementations of Temporal Operators.A large body of work has focused on the implementation of individual temporal algebra operators such as joins [11,17,32] and aggregation [5,31,33].Some exceptions supporting multiple operators are [13,18,25].These approaches introduce efficient evaluation algorithms for a particular semantics of a temporal algebra operator.Our approach can utilize efficient operator implementations as long as (i) their semantics is compatible with our interval-based encoding of snapshot query results and (ii) they are snapshot-reducible.
Coalescing.Coalescing produces a unique representation of a set semantics temporal database.Böhlen et al. [10] study optimizations for coalescing that eliminate unnecessary coalescing operations.Zhou et al. [47] and [1] use analytical functions to efficiently implement coalescing in SQL.We generalize coalescing to K-relations to define a unique encoding of interval-based temporal relations, including multiset relations.Similar to [10], we remove unnecessary K-coalescing steps and, similar to [47], we use OLAP functions for efficient implementation.
Temporality in Annotated Databases.Kostiley et al. [26] is to the best of our knowledge the only previous approach that uses semiring annotations to express temporality.The authors define a semiring whose elements are sets of time points.This approach is limited to set semantics, and no interval-based encoding was presented.The LIVE system [15] combines provenance and uncertainty annotations with versioning.The system uses interval timestamps, and query semantics is based on snapshot-reducibility [15,Def. 2].However, computing the intervals associated with a query result requires provenance to be maintained for every query result.

SOLUTION OVERVIEW
In this section, we give an overview of our three-level framework, which is illustrated in Figure 2.
Abstract model -Snapshot K-relations.As an abstract model we use snapshot relations which map time points to snapshots.Queries over such relations are evaluated over each snapshot, which trivially satisfies snapshot-reducibility.To support both sets and multisets, we introduce snapshot K-relations [20], which are snapshot relations where each snapshot is a K-relation.In a K-relation, each tuple is annotated with an element from a domain K.For example, relations annotated with elements from the semiring N (natural numbers) correspond to multiset semantics.The result of a snapshot query Q over a snapshot K-relation is the result of evaluating Q over the K-relation at each time point.
Example 3.1 (Abstract Model).Figure 2 (bottom) shows the snapshots at times 00, 08, and 18 of an encoding of the running example as snapshot N-relations.Each snapshot is an N-relation where tuples are annotated with their multiplicity (shown with shaded background).For instance, the snapshot at time 08 has three tuples, each with multiplicity 1.The result of query Q onduty is shown on the bottom right.Every snapshot in the result is computed by running Q onduty over the corresponding snapshot in the input.For instance, at time 08 there are two SP workers, i.e., cnt = 2.
Logical Model -Period K-relations.We introduce period Krelations as a logical model, which merges equivalent tuples over all snapshots from the abstract model into one tuple.In a period K-relation, every tuple is annotated with a temporal K-element that is a unique interval-based representation for all time points of the merged tuples from the abstract model.We define a class Figure 2: Overview of our approach.Our abstract model is snapshot K-relations and nontemporal queries over snapshots (snapshot semantics).Our logical model is period K-relations and queries corresponding to the abstract model's snapshot queries.Our implementation uses SQL period relations and rewritten non-temporal queries implementing the other model's snapshot queries.Each model is associated with transformations to the other models which commute with queries (modulo the rewriting REWR when mapping to the implementation). of semirings called period semirings whose elements are temporal K-elements.Specifically, for any semiring K we can construct a period semiring KT whose annotations are temporal Kelements.For instance, NT is the period semiring corresponding to semiring N (multisets).We define necessary conditions for an interval-based model to correctly encode snapshot K-relations and prove that period K-relations fullfil these conditions.Specifically, we call an interval-based model a representation system iff the encoding of every snapshot K-relation R is (i) unique and (ii) snapshot-equivalent to R. Furthermore, (iii) queries over encodings are snapshot-reducible.
Example 3.2 (Logical Model).Figure 2 (middle) shows an encoding of the running example as period K-relations.For instance, all tuples (Ann, SP ) from the abstract model are merged into one tuple in the logical model with annotation {[03, 10) → 1, [18,20) → 1}, because at each time point during [03, 10) and [18,20) a tuple (Ann, SP ) with multiplicity 1 exists.In Section 4.2, we will introduce a mapping ENC N from snapshot N to NT -relations and the time slice operator τT which restores an the snapshot at time T .
Implementation -SQL Period Relations.To ensure compatibility with the SQL standard, we use SQL period relations in our implementation and translate snapshot semantics queries into SQL queries over these period relations.For this we define an encoding of NT -relations as SQL period relations (PERIODENC) together with a rewriting scheme for queries (REWR).We present an implementation of our framework as a database middleware that exposes snapshot semantics as a new language feature in SQL and rewrites snapshot queries into SQL queries over SQL period relations.That is, we directly evaluate snapshot queries over data stored natively as period relations.

SNAPSHOT K-RELATIONS
We first review background on the semiring annotation framework (K-relations).Afterwards, we define snapshot K-relations as our abstract model and snapshot semantics for this model.Importantly, queries over snapshot K-relations are snapshot-reducible by construction.Finally, we state requirements for a logical model to be a representation system for this abstract model.

K-relations
In a K-relation [20], every tuple is annotated with an element from a domain K of a commutative semiring K.A structure (K, +K , •K , 0K , 1K ) over a set K with binary operations +K and •K is a commutative semiring iff (i) addition and multiplication are commutative, associative, and have a neutral element (0K and 1K , respectively); (ii) multiplication distributes over addition; and (iii) multiplication with zero returns zero.Abusing notation, we will use K to denote both a semiring structure as well as its domain.
Consider a universal countable domain U of values.An n-ary K-relation R over U is a (total) function that maps tuples (elements from U n ) to elements from K with the convention that tuples mapped to 0K are not in the relation.Furthermore, we require that R(t) = 0K only holds for finitely many t.Two semirings are of particular interest to us: The semiring (B, ∨, ∧, f alse, true) with elements true and false using ∨ as addition and ∧ as multiplication corresponds to set semantics.The semiring (N, +, •, 0, 1) of natural numbers with standard arithmetics corresponds to multisets.
The operators of the positive relational algebra [36] (RA + ) over K-relations are defined by applying the +K and •K operations of the semiring K to input annotations.Intuitively, the +K and •K operations of the semiring correspond to the alternative and conjunctive use of tuples, respectively.For instance, if an output tuple t is produced by joining two input tuples annotated with k and k , then the tuple t is annotated with k •K k .Below we provide the standard definition of RA + over K-relations [20].For a tuple t, we use t.A to denote the projection of t on a list of projection expressions A and t[R] to denote the projection of t on the attributes of relation R. For a condition θ and tuple t, θ(t) denotes a function that returns 1K if t |= θ and 0K otherwise.Definition 4.1 (RA + over K-relations).Let K be a semiring, R, S denote K-relations, t, u denote tuples of appropriate arity, and k ∈ K. RA + on K-relations is defined as: We will make use of homomorphisms, functions from the domain of a semiring K1 to the domain of a semiring K2 that commute with the semiring operations.Since RA + over K-relations is defined in terms of these operations, it follows that semiring homomorphisms commute with queries, as was proven in [20].
Consider the N-relations shown below which are non-temporal versions of our running example.Query Q = Π mach ( works assign) returns machines for which there are workers with the right skill to operate the machine.Under multiset semantics we expect M1 to occur in the result of Q with multiplicity 8 since (M 1, SP ) joins with (P ete, SP ) and with (Bob, SP ).Evaluating the query in N yields the expected result by multiplying the annotations of these join partners.Given the N result of the query, we can compute the result of the query under set semantics by applying a homomorphism h which maps all non-zero annotations to true and 0 to false.For example, for result (M 1) we get h(8) = true, i.e., this tuple is in the result under set semantics.

Snapshot K-relations
We now formally define snapshot K-relations, snapshot semantics over such relations, and then define representation systems.We assume a totally ordered and finite domain T of time points and use ≤ T to denote its order.Tmin and Tmax denote the minimal and maximal (exclusive) time point in T according to ≤ T , respectively.We use T + 1 to denote the successor of T ∈ T according to ≤ T .
A snapshot K-relation over a relation schema R is a function T → R K,R , where R K,R is the set of all K-relations with schema R. Snapshot K-databases are defined analog.We use DB T,K to denote the set of all snapshot K-databases for time domain T.
For instance, a snapshot N-relation is shown in Figure 2 (bottom).Given a snapshot K-relation, we use the timeslice operator [23] to access its state (snapshot) at a time point T : The evaluation of a query Q over a snapshot database (set of snapshot relations) D under snapshot semantics returns a snapshot relation Q(D) that is constructed as follows: for each time point T ∈ T we have Q(D)(T ) = Q(D(T )).Thus, snapshot temporal queries over snapshot K-relations behave like queries over K-relations for each snapshot, i.e., their semantics is uniquely determined by the semantics of queries over K-relations.Definition 4.4 (Snapshot Semantics).Let D be a snapshot Kdatabase and Q be a query.The result Q(D) of Q over D is a snapshot K-relation that is defined point-wise as follows: For example, consider the snapshot N-relation shown at the bottom of Figure 2 and the evaluation of Q onduty under snapshot semantics as also shown in this figure.Observe how the query result is computed by evaluating Q onduty over each snapshot individually using multiset (N) query semantics.Furthermore, since τT (Q(R)) = Q(R)(T ), per the above definition, the timeslice operator commutes with queries: τT (Q(R)) = Q(τT (R)).This property is snapshot-reducibility.

Representation Systems
To compactly encode snapshot K-relations, we study representation systems that consist of a set of representations E, a function ENC : E → DB T,K which associates an encoding in E with the snapshot K-database it represents, and a timeslice operator τT which extracts the snapshot at time T from an encoding.If ENC is injective, then we use ENC −1 (D) to denote the unique encoding associated with D. We use τ to denote the timeslice over both snapshot databases and representations.It will be clear from the input which operator τ refers to.For such a representation system, we consider two encodings D1 and D2 from E to be snapshotequivalent [21] (written as D1 ∼ D2) if they encode the same snapshot K-database.Note that this is the case if they encode the same snapshots, i.e., iff for all T ∈ T we have τT (D1) = τT (D2).For a representation system to behave correctly, the following conditions have to be met: 1) uniqueness: for each snapshot K-database D there exists a unique element from E representing D; 2) snapshotreducibility: the timeslice operator commutes with queries; and 3) snapshot-preservation: the encoding function ENC preserves the snapshots of the input.Definition 4.5 (Representation System).We call a triple (E, ENC, τ) a representation system for snapshot K-databases with regard to a class of queries C iff for every snapshot database D, encodings E, E ∈ E, time point T , and query Q ∈ C we have

TEMPORAL K-ELEMENTS
We now introduce temporal K-elements that are the annotations we use to define our logical model (representation system).Temporal K-elements record, using an interval-based encoding, how the K-annotation of a tuple in a snapshot K-relation changes over time.We introduce a unique normal form for temporal K-elements based on a generalization of coalescing [10].

Defining Temporal K-elements
To define temporal K-elements, we need to introduce some background on intervals.Given the time domain T and its associated total order ≤ T , an interval I = [T b , Te) is a pair of time points from T, where T b < T Te.Interval I represents the set of contiguous time points {T | T ∈ T ∧ T b ≤ T T < T Te}.For an interval I = [T b , Te) we use I + to denote T b and I − to denote Te.We use I, I , I1, . . . to represent intervals.We define a relation adj(I1, I2) that contains all interval pairs that are adjacent: adj(I1, I2) ⇔ (I1 − = I2 + ) ∨ (I2 − = I1 + ).We will implicitly understand set operations, such as t ∈ I or I1 ⊆ I2, to be interpreted over the set of points represented by an interval.Furthermore, I ∩ I denotes the interval that covers precisely the intersection of the sets of time points defined by I and I and I ∪ I denotes their union (only well-defined if I ∩ I = ∅ or adj(I, I )).
For convenience, we define We use I to denote the set of all intervals over T. Definition 5.1 (Temporal K-elements).Given a semiring K, a temporal K-element T is a function I → K.We use TEK to denote the set of all such temporal elements for K.
We represent temporal K-elements as sets of input-output pairs.Intervals that are not explicitly mentioned are mapped to 0K .
Example 5.1.Reconsider our running example with T = {00, . . ., 23}.The history of the annotation of tuple t = (Ann,SP) from the works relation is as shown in Figure 2 (middle).For sake of the example, we change the multiplicity of this tuple to 3 during [03, 09) and 2 during [18,20).This information is encoded as the temporal N-element Note that a temporal K-element T may map overlapping intervals to non-zero elements of K. We assign the following semantics to overlap: the annotation at a time point T recorded by T is the sum of the annotations assigned to intervals containing T .For instance, the annotation at time 04 for the N-element T = {[00, 05) → 2, [04, 05) → 1} would be 2 + 1 = 3.To extract the annotation valid at time T from a temporal K-element T , we define a timeslice operator for temporal K-elements as follows: Given two temporal K-elements T1 and T2, we would like to know if they represent the same history of annotations.For that, we define snapshot-equivalence (∼) for temporal K-elements:

A Normal Form Based on K-Coalescing
The encoding of the annotation history of a tuple as a temporal K-element is typically not unique.
To be able to build a representation system based on temporal K-elements we need a unique way to encode the annotation history of a tuple as a temporal K-element (condition 1 of Definition 4.5).That is, we need to define a normal form that is unique [11, 13) for snapshot-equivalent temporal K-elements.To this end, we generalize coalescing, which was defined for temporal databases with set semantics in [10,37].The generalized form, which we call Kcoalescing, coincides with standard coalescing for semiring B (set semantics) and, for any semiring K, yields a unique encoding.K-coalescing creates maximal intervals of contiguous time points with the same annotation.The output is a temporal Kelement such that (a) no two intervals mapped to a non-zero element overlap and (b) adjacent intervals assigned to non-zero elements are guaranteed to be mapped to different annotations.To determine such intervals, we define annotation changepoints, time points T where the annotation of a temporal K-element differs from the annotation at T − 1, i.e., τT (T ) = τT −1(T )).It will be convenient to also consider Tmin as an annotation changepoint.Definition 5.2 (Annotation Changepoint).Given a temporal Kelement T , a time point T is called a changepoint in T if one of the following conditions holds: We use CP (T ) to denote the set of all annotation changepoints for T .Furthermore, we define CP I(T ) to be the set of all intervals that consist of consecutive change points: In Definition 5.2, CP I(T ) computes maximal intervals such that the annotation assigned by T to each point in such an interval is constant.In the coalesced representation of T only such intervals are mapped to non-zero annotations.

Definition 5.3 (K-Coalesce).
Let T be a temporal K-element.We define K-coalescing CK as a function TEK → TEK : We use TECK to denote all normalized temporal K-elements, i.e., elements T for which CK (T ) = T for some T .
Example 5.3.Consider the SQL period relation shown in Figure 3.The temporal N-elements encode the history of tuples (30k), (40k) and (50k).Note that T 30k is not coalesced since the two non-zero intervals of this N-element overlap.Applying N-coalesce we get: That is, this tuple occurs twice within the time interval [3,10) and once in [10,13), i.e., it has annotation changepoints 3, 10, and 14.Interpreting the same relation under set semantics (semiring B), the history of (30k) can be encoded as a temporal B-element That is, this tuple occurs (is annotated with true) within the time interval [3,13) and its annotation changepoints are 3 and 14.
We now prove several important properties of the K-coalesce operator establishing that TECK (coalesced temporal K-elements) is a good choice for a normal form of temporal K-elements.Lemma 5.1.Let K be a semiring and T , T1 and T2 temporal Kelements.We have: Proof.All proofs are shown in Appendix A.

PERIOD SEMIRINGS
Having established a unique normal form of temporal Kelements, we now proceed to define period semirings as our logical model.The elements of a period semiring are temporal K-elements in normal form.We prove that these structures are semirings and ultimately that relations annotated with period semirings form a representation system for snapshot K-relations for RA + .In Section 7, we then prove them to also be a representation system for RA agg , i.e., queries involving difference and aggregation.
When defining the addition and multiplication operations and their neutral elements in the semiring structure of temporal Kelements, we have to ensure that these definitions are compatible with semiring K on snapshots.Furthermore, we need to ensure that the output of these operations is guaranteed to be K-coalesced.The latter can be ensured by applying K-coalesce to the output of the operation.For addition, snapshot reducibility is achieved by pointwise addition (denoted as +K P ) of the two functions that constitute the two input temporal K-elements.That is, for each interval I, the function that is the result of the addition of temporal K-elements T1 and T2 assigns to I the value T1(I) +K T2(I).For multiplication, the multiplication of two K-elements assigned to an overlapping pair of intervals I1 and I2 is valid during the intersection of I1 and I2.Since both input temporal K-elements may assign non-zero values to multiple intervals that have the same overlap, the resulting K-value at a point T would be the sum over all pairs of overlapping intervals.We denote this operation as •K P .Since +K P and •K P may return a temporal K-element that is not coalesced, we define the operations of our structures to apply CK to the result of +K P and •K P .The zero element of the temporal extension of K is the temporal K-element that maps all intervals to 0 and the 1 element is the temporal element that maps every interval to 0K except for [Tmin, Tmax) which is mapped to 1K .Definition 6.1 (Period Semiring).For a time domain T with minimum Tmin and maximum Tmax and a semiring K, the period semiring KT is defined as: where for k, k ∈ TECK and : Thus, as expected, the result records that, e.g., there are two skilled workers (SP) on duty during time interval [08, 10).
Having defined the family of period semirings, it remains to be shown that KT with standard K-relational query semantics is a representation system for snapshot K-relations.

KT is a Semiring
As a first step, we prove that for any semiring K, the structure KT is also a semiring.The following lemma shows that Kcoalesce can be redundantly pushed into +K P and •K P operations.Lemma 6.1.Let K be a semiring and k, k ∈ TECK .Then, Using this lemma, we now prove that for any semiring K, the structure KT is also a semiring.Theorem 6.2.For any semiring K, structure KT is a semiring.

Timeslice Operator
We define a timeslice operator for KT -relations based on the timeslice operator for temporal K-elements.We annotate each tuple in the output of this operator with the result of τT applied to the temporal K-element the tuple is annotated with.Definition 6.2 (Timeslice for KT -relations).Let R be a KTrelation and T ∈ T. The timeslice operator τT (R) is defined as: We now prove that the τT is a homomorphism KT → K. Since semiring homomorphisms commute with queries [20], KT equipped with this timeslice operator does fulfill the snapshotreducibility condition of representation systems (Definition 4.5).Theorem 6.3.For any T ∈ T, the timeslice operator τT is a semiring homomorphism from KT to K.
As an example of the application of this homomorphism, consider the period N-relation works from our running example as shown on the left of Figure 2. Applying τ08 to this relation yields the snapshot shown on the bottom of this figure (three employees work between 8am and 9am out of whom two are specialized).If we evaluate query Q onduty over this snapshot we get the snapshot shown on the right of this figure (the count is 2).By Theorem 6.3 we get the same result if we evaluate Q onduty over the input period N-relation and then apply τ08 to the result.

Encoding of Snapshot K-relations
We now define a bijective mapping ENCK from snapshot Krelations to KT -relations.We then prove that the set of KTrelations together with the timeslice operator for such relations and the mapping ENCK −1 (the inverse of ENCK ) form a representation system for snapshot K-relations.Intuitively, ENCK (R) is constructed by assigning each tuple t a temporal K-element where the annotation of the tuple at time T (i.e., R(T )(t)) is assigned to a singleton interval [T, T + 1).This temporal K-element TR,t is then coalesced to create a TECK element.Definition 6.3.Let K be a semiring and R a snapshot K-relation, ENCK is a mapping from snapshot K-relations to KT -relations defined as follows.
We first prove that this mapping is bijective, i.e., it is invertible, which guarantees that ENCK −1 is well-defined and also implies uniqueness (condition 1 of Definition 4.5).Lemma 6.4.For any semiring K, ENCK is bijective.
Next, we have to show that ENCK preserves snapshots, i.e., the instance at a time point T represented by R can be extracted from ENCK (R) using the timeslice operator.Based on these properties of ENCK and the fact that the timeslice operator over KT -relations is a homomorphism KT → K, our main technical result follows immediately.That is, the set of KT -relations equipped with the timeslice operator and ENCK −1 is a representation system for positive relational algebra queries (RA + ) over snapshot K-relations.Theorem 6.6 (Representation System).Given a semiring K, let DBK T be the set of all KT -relations.
The triple (DBK T , ENCK −1 , τ) is a representation system for RA + queries over snapshot K-relations.

COMPLEX QUERIES
Having proven that KT -relations form a representation system for RA + , we now study extensions for difference and aggregation.

Difference
Extensions of K-relations for difference have been studied in [3,19].For instance, the difference operator on N relations corresponds to bag difference (SQL's EXCEPT ALL).Geerts et al. [19] apply an extension of semirings with a monus operation that is defined based on the natural order of a semiring and demonstrated how to define a difference operation for K-relations based on the monus operation for semirings where this operations is welldefined.Following the terminology introduced in this work, we refer to semirings with a monus operation as m-semirings.We now prove that if a semiring K has a well-defined monus, then so does KT .From this follows, that for any such K, the difference operation is well-defined for KT .We proceed to show that the timeslice operator is an m-semiring homomorphism, which implies that KTrelations for any m-semiring K form a representation system for RA (full relational algebra).The definition of a monus operator is based on the so-called natural order K .For two elements k and For instance, N is naturally ordered ( N corresponds to the order of natural numbers) while Z is not (for any k, k ∈ Z we have k Z k ).For the monus to be well-defined on K, K has to be naturally ordered and for any k, k ∈ K, the set {k | k K k +K k } has to have a smallest member.For any semiring fulfilling these two conditions, the monus operation −K is defined as k −K k = k where k is the smallest element such that k K k + k .For instance, the monus for N is the truncating minus: Theorem 7.1.For any m-semiring K, semiring KT has a welldefined monus, i.e., is an m-semiring.
Let k −K P k denote an operation that returns a temporal Kelement which assigns to each singleton interval [T, T + 1) the result of the monus for K: τT (k) −K τT (k ) (this is kpmin as defined in the proof of Theorem 7.1, see Appendix A).In the proof of Theorem 7.1, we demonstrate that k Obviously, computing k −K P k using singleton intervals is not effective.In our implementation, we use a more efficient way to compute the monus for KT that is based on normalizing the input temporal K-elements k and k such that annotations are attached to larger time intervals where k −K P k is guaranteed to be constant.Importantly, τT is a homomorphism for monus-semiring KT .Theorem 7.2.Mapping τT is an m-semiring homomorphism.
For example, consider Q skillreq from Example 1.2 which can be expressed in relational algebra as Π skill (assign) − Π skill (worker).The NT -relation corresponding to the period relation assign shown in this example annotates each tuple with a singleton temporal N-element mapping the period of this tuple to 1, e.g., (M1, SP) is annotated with {[03, 12) → 1}.The annotation of result tuple (SP) is computed as As expected, the result is the same as the one from Example 1.2.

Aggregation
The K-relational framework has previously been extended to support aggregation [4].This required the introduction of attribute domains which are symbolic expressions that pair values with semiring elements to represent aggregated values.Since the construction used in this work to derive the mathematical structures representing these symbolic expressions is applicable to all semirings, it is also applicable to our period semirings.It was shown that semiring homomorphisms can be lifted to these more complex annotation structures and attribute domains.Thus, the timeslice operator, being a semiring homomorphism, commutes with queries including aggregation, and it follows that using the approach from [4], we can define a representation system for snapshot K-relations under RA with aggregation, i.e., RA agg .
One drawback of this definition of aggregation over K-relations with respect to our use case is that there are multiple ways of encoding the same snapshot K-relation in this model.That is, we would loose uniqueness of our representation system.Recall that one of our major goals is to implement snapshot query semantics on-top of DBMS using a period multiset encoding of NT -relations.The symbolic expressions representing aggregation function results are a compact representation which, in case of our interval-temporal semirings, encode how the aggregation function results change over time.However, it is not clear how to effectively encode the symbolic attribute values and comparisons of symbolic expression as multiset semantics relations, and how to efficiently implement our snapshot semantics over this encoding.Nonetheless, for N, we can apply a simpler definition of aggregation that returns a KT relation and is also a representation system.For simplicity, we define aggregation Gγf (A) (R) grouping on G to compute a single aggregation function f over the values of an attribute A. For convenience, aggregation without group-by, i.e., γ f (A) (R) is expressed using an empty group-by list.Definition 7.1 (Aggregation).Let R be a NT relation.Operator Gγf (A) (R) groups the input on a (possibly empty) list of attributes G = (g1, . . ., gn) and computes aggregation function f over the values of attribute A. This operator is defined as follows: In the output of the aggregation operator, each tuple t is annotated with a N-coalesced temporal N-element which is constructed from singleton intervals.A singleton interval I = [T, T + 1) is mapped to 1 if evaluating the aggregation over the multiset relation corresponding to the snapshot at T returns tuple t.We now demonstrate that NT using this definition of aggregation is a representation system for snapshot N-relations.
Theorem 7.3.NT -relations form a representation system for snapshot N-relations and RA agg queries using aggregation according to Definition 7.1.

SQL PERIOD RELATION ENCODING
While provably correct, the annotation structure that we have defined is quite complex in nature raising concerns on how to efficiently implement it.We now demonstrate that NT -relations (multisets) can be encoded as SQL period relations (as shown on the top of Figure 2).Recall that SQL period relations are multiset relations where the validity time interval (period) of a tuple is stored in an interval-valued attribute (or as two attributes storing interval end points).Queries over NT are then translated into non-temporal multiset queries over this encoding.In addition to employing a proven and simple representation of time this enables our approach to run snapshot queries over such relations without requiring any preprocessing and to implement our ideas on top of a classical DBMS.For convenience we represent SQL period relations using non-temporal N-relations in the definitions.SQL period relations can be obtained based on the well-known correspondence between multiset relations and N-relations: we duplicate each tuple based on the multiplicity recorded in its annotation.To encode NT -relations as N-relations we introduce an invertible mapping PERIODENC.We rewrite queries with NT -semantics into non-temporal queries with N-semantics over this encoding using a rewriting function REWR.This is illustrated in the commutative diagram below.
Our encoding represents a tuple t annotated with a temporal element T as a set of tuples, one for each interval I which is assigned a non-zero value by T .For each such interval, the interval's end points are stored in two attributes A begin and A end , which are appended to the schema of t.Again, we use t → k to denote that tuple t is annotated with k and U to denote a universal domain of values.We use SCH(R) to denote the schema of relation R and arity(R) to denote its arity (the number of attributes in the schema).Definition 8.1 (Encoding as SQL Period Relations).PERIODENC is a function from NT -relations to N-relations.Let R be a NT relation with schema SCH(R) = {A1, . . ., An}.The schema of PERIODENC(R) is {A1, . . ., An, A begin , A end }.Let R be PERIODENC(R) for some NT -relation.PERIODENC and its inverse are defined as follows: Before we define the rewriting REWR that reduces a query Q with NT semantics to a query with N semantics, we introduce two operators that we will make use of in the reduction.The N-coalesce operator applies C N to the annotation of each tuple in its input.Definition 8.2 (Coalesce Operator).Let R be PERIODENC(R ) for some NT -relation R .The coalesce operator C(R) is defined as: The split operator NG(R, S) splits the intervals in the temporal elements annotating a tuple t based on the union of all interval end points from annotations of tuples t which agree with t on attributes G. Inputs R and S have to be union compatible.The effect of this operator is that all pairs of intervals mapped to non-zero elements are either the same or are disjoint.This operator has been applied in [16,18] and in [12,46].We use it to implement snapshotreducible aggregation and difference over intervals instead of single snapshots as in Section 7. Recall that in Section 7, the monus (difference) and aggregation were defined in a point-wise manner.The split operator allows us to evaluate these operations over intervals directly by generating tuples with intervals for which the result of these operations is guaranteed to be constant.Definition 8.3 (Split Operator).The split operator NG(R1, R2) takes as input two N-relations R1 and R2 that are encodings of NT -relations.For a tuple t in such an encoding let I(t) = [t.A begin , t.A end ).The split operator is defined as: Note that the PERIODENC and PERIODENC −1 mappings are only used in the definitions of the coalesce and split algebra operators for ease of presentation.These operators can be implemented as SQL queries executed over an PERIODENC-encoded relation.Example 8.1.Reconsider query Q onduty from Example 1.1 and its results for the logical model and period relations (Figure 2).In relational algebra, the input query is written as ). Applying REWR we get: Subquery REWR(Q1) filters out the second tuple from the input (see Figure 2).The split operator is then applied to the union of the result of REWR(Q1) and a tuple with the neutral element null for the aggregation function and period [Tmin, Tmax), where Tmin = 0 and Tmax = 24 for this example.After the split N , the aggregation is evaluated grouping the input on A begin , A end .The count aggregation function then either counts a sequence of 1s and a single null value producing the number of facts that overlap over the corresponding period [A begin , A end ), or counts a single null value over a "gap" producing 0. For instance, for [08, 10) there are two facts whose intervals cover this period (Ann and Sam) and, thus, (2, [08, 10)) is returned by REWR(Q onduty ).While for for [20,24) there are no facts and thus we get (0, [20,24)).

IMPLEMENTATION
We have implemented the encoding and rewriting introduced in the previous section in a middleware which supports snapshot multiset semantics through an extension of SQL.To instruct the system to interpret a subquery using snapshot semantics, the user encloses the subquery in a SEQ VT (...) block.We assume that the inputs to a snapshot query are encoded as period multiset relations, i.e., each relation has two temporal attributes that store the begin and end timestamp of their validity interval.For each relation access within a SEQ VT block, the user has to specify which attributes store the period of a tuple.
Our coalescing and split operators can be expressed in SQL.Thus, a straightforward way of incorporating these operators into the compilation process is to devise additional rewrites that produce the relational algebra code for these operators where necessary.However, preliminary experiments demonstrated that a naive implementation of these operators is prohibitively expensive.
We address this problem in two ways.First, we observe that it is sufficient to apply coalesce as a last step in a query instead of applying it as part of every operator rewrite.Applying this optimization, the rewritten version of a query will only contain one coalesce operator.Recall from Lemma 6.1 that coalescing can be redundantly pushed into the addition and multiplication operations of period semirings, e.g., CK (k +K P k ) = CK (CK (k) +K P k ).We prove that this Lemma also holds for monus in Appendix E. Interpreting this equivalence from right to left and applying it repeatedly to a semiring expression e, e can be rewritten into an equivalent expression of the form CK (e ), where e is an expression that only uses operations +K P , •K P , −K P .Since relational algebra over Krelations is defined by applying multiplication, addition, and monus to input annotations, this implies that it is sufficient to apply coalescing only as a final operation in a query.For an example and additional discussion see Appendix E We developed an optimized implementation of multiset coalescing using SQL analytical window functions, similar to set-based coalescing in [47], that counts for value-equivalent attributes the number of open intervals per time point, determines change points based on differences between these counts, and then only output maximal intervals using a filter step.This implementation uses sorting in its window declarations and has time complexity O(n log n) for n tuples.A native implementation would require only one sorting step.The number of sorting steps required by our SQL implementation depends on whether the DBMS is capable of sharing window declaration (we observe 2 and 7 sorting steps for the systems used in our experimental evaluation).
For aggregation we integrate the split operator into the aggregation.It turned out to be most effective to pre-aggregate the input before splitting and then compute the final aggregation results during the split step by further aggregating the results of the preaggregation step.We apply a similar optimization for difference.

EXPERIMENTS
In our experimental evaluation we focus on two aspects.First, we evaluate the cost of our SQL implementation of N-coalescing (multiset coalescing).Then, we evaluate the performance of snapshot queries with our approach over three DBMSs and compare it against native implementations of snapshot semantics that are available in two of these systems (using our implementation of coalescing to produce a coalesced result).

Workloads and Experimental Setup
Datasets.We use three datasets in our experiments.The MySQL Employees dataset (https://github.com/datacharmer/test_db) which contains ≈4 million records and consists of the following six period tables: table employee stores basic information about employees; table departments stores department information; table titles stores the job titles for employees; table salaries stores employee salaries; table dept manager stores which employee manages which department; and table dept emp stores which employee is working for which department.TPC-BiH is the bi-temporal version of the TPC-H benchmark dataset as described in [24].Since our approach supports only one time dimension we only generated the valid time dimension for this dataset.In this configuration a scale factor 1 (SF1) database corresponds to roughly 1GB of data.
The Tourism dataset (835k records) consists of a single table storing hotel reservations in South Tyrol.Each record corresponds to one reservation.The validity end points of the time period associated with a record is the arrival and departure time.

Workloads.
We have created a workload consisting of 10 queries to evaluate the efficiency of snapshot queries.Queries join-1 to join-4 are join queries, agg-1 to agg-3 are aggregation-heavy queries, agg-join is a join with an aggregation value, and diff-1 and diff-2 use difference.Furthermore, we use one query template varying the selectivity to evaluate the performance of coalescing.C-Sn denotes the variant of this query that returns approximately nK rows, e.g., C-S1 returns 1,000 rows.For the Tourism dataset we use the following queries.join: tourist from same country to same destination using a self join of tourismdata table.agg-0: number of tourists per destination together with the average number of tourists for all other destinations.This query first computes the number of tourists per destination and do a self unequal join on it.agg-1: number of enquiries and the number of tourists per destination with more than 1000 enquiries using two aggregations on tourismdata table.agg-2: maximum number of tourists per destination using an aggregation on tourismdata table.tou-agg-x: the destination with the most number of tourists.This query has no join but two aggregations, one to compute the number of tourists per destination and a second one to compute the maximum one.More detailed descriptions of these queries are provided in Appendix B. For the TPC-BiH dataset we took 9 of the 22 standard queries [14] from this benchmark that do not contain nested subqueries or LIMIT (which are not supported by our or any other approach for snapshot queries we are aware of) and evaluated these queries under snapshot semantics.Note that some of these queries use the ORDER BY clause that we do not support for snapshot queries.However, we can evaluate such a query without ORDER BY under snapshot semantics and then sort the result without affecting what rows are returned.The number of rows returned by these queries over the dataset are shown in Table 2. Systems.We ran experiments on three different database management systems: a version of Postgres (PG) with native support for temporal operators as described in [16,18]; a commercial DBMS, DBX, with native support for snapshot semantics (only available as a virtual machine); and a commercial DBMS, DBY, without native support for snapshot semantics.We used our approach to translate snapshot queries into standard SQL queries and ran the translated queries on all three systems (denoted as PG-Seq, DBX-Seq, and DBY-Seq).For PG and DBX, we ran the queries also with the native solution for snapshot semantics paired with our implementation of coalescing to produce a coalesced result (referred to as PG-Nat and DBX-Nat).As explained in Section 2, no system correctly implements snapshot multiset semantics for difference and aggregation, and many systems do not support snapshot semantics for these operators at all.DBX-Nat and PG-Nat both support snapshot aggregation, however, their implementations are not snapshot-reducible.DBX-Nat does not support snapshot difference, whereas PG-Nat implements temporal difference with set semantics.Despite such differences, the experimental comparison allows us to understand the performance impact of our provably correct approach.All experiments were executed on a machine with 2 AMD Opteron 4238 CPUs, 128GB RAM, and a hardware RAID with 4 × 1TB 72.K HDs in RAID 5.For Postgres we set the buffer pool size to 8GB.For the other systems we use values recommended by the automated configuration tools of these systems.We execute queries with warm cache.For short-running queries we show the median runtime across 100 consecutive runs.For long running queries we computed the median over 10 runs.In general we observed low variation in runtimes (a few percent).

Multiset Coalescing
To evaluate the performance of coalescing, we use a selection query that returns employees that earn more than a specific salary and materialize the result as a table.The selectivity varies from 1K to 3M rows.We then evaluate the query SELECT * FROM ... over the materialized tables under snapshot semantics in order to measure the cost of coalescing in isolation.Figure 5 shows the results of this experiment.The runtime of coalescing is linear in the input size for all three systems.Even though the theoretical worstcase complexity of the sorting step, which is applied by all systems to evaluate the analytics functions that we exploit in our SQLbased implementation of multiset coalescing, is O(n • log(n)), an inspection of the execution plans revealed that the sorting step only amounts to 5%-10% of the execution time (for all selectivities) and, hence, is not a dominating factor.

Snapshot Semantics -Employee
Table 3 provides an overview of the performance results for our snapshot query workloads.For every query we indicate in the rightmost column whether native approaches are subject to the aggregation gap (AG) or bag difference (BD) bugs.
Join Queries.The performance of our approach for join queries is comparable with the native implementation in PG-Nat.For join queries with larger intermediate results (join-2), the native implementation outperforms our approach by ≈73%.Running the queries produced by our approach in DBY is slightly faster than both.DBX-Nat uses merge joins for temporal joins, while both PG and DBY use a hash-join on the non-temporal part of the join condition.The result is that DBX-Nat significantly outperforms the other methods for temporal join operations.However, the larger cost for the SQL-based coalescing implementation in this system often outweighs this effect.This demonstrates the potential for improving our approach by making use of native implementations of temporal operators in our rewrites for operators that are compatible with our semantics (note that joins are compatible).
Aggregation Queries.Our approach outperforms the native implementations of snapshot semantics on all systems by several orders of magnitude for aggregation queries as long as the aggregation input exceeds a certain size (agg-1 and agg-2).Our approach as well as the native approaches split the aggregation input which requires sorting and then apply a standard aggregation operator to compute the temporal aggregation result.The main reason for the large performance difference is that the SQL code we generate for a snapshot aggregation includes several levels of preaggregation that are intertwined with the split operator.Thus, for our approach the sorting step for split is applied to a typically much smaller pre-aggregated dataset.This turned out to be quite effective.The only exception is if the aggregation input is very small (agg-3) in which case an efficient implementation of split (as in PG-Nat) outweighs the benefits of pre-aggregation.Query agg-1 did not finish on DBX-Nat as it exceeded the 2GB temporary space restriction (memory allocated for intermediate results) of the freely available version of this DBMS.
Mixed Aggregation and Join.Query agg-join applies an aggregation over the result of several joins.Our approach is more effective, in particular for the aggregation part of this query, compared to PG-Nat.This query did not finish on DBX due to the 2GB temporary space restriction per query imposed by the DBMS.
Difference Queries.For difference queries we could only compare our approach against PG-Nat, since DBX-Nat does not support difference in snapshot queries.Note that, PG-Nat applies set dif-ference while our approach supports multiset difference.While our approach is less effective for diff-1 which contains a single difference operator, we outperform PG-Nat on diff-2.

Snapshot Semantics -TPC-BiH
The runtimes for TPC-H queries interpreted under snapshot semantics (9 queries are currently supported by the approaches) over the 1GB and 10GB valid time versions of TPC-BiH is also shown in Table 3.For this experiment we skip DBX since the limitation to 2GB of temporary space of the free version we were using made it impossible to run most of these queries.Overall we observe that our approach scales roughly linearly from 1GB to 10GB for these queries.We significantly outperform PG-Nat because all of these queries use aggregation.Additionally, some of these queries use up to 7 joins.For these queries the fact that PG-Nat aligns both inputs with respect to each other [16] introduces unnecessary overhead and limits join reordering.The combined effect of these two drawbacks is quite severe.Our approach is 1 to 3 orders of magnitude faster than PG-Nat.For some queries this is a lower bound on the overhead of PG-Nat since the system timed out for these queries (we stopped queries that did not finish within 2 hours).

Snapshot Semantics -Tourism
The results for the queries over the Tourism database are shown in the middle of Table 3.We only report our approach for Postgres and DBY, and the native implementation in Postgres.With the exception of query tou-agg-2 our approach outperforms PG-Nat quite significantly since all these queries contain aggregation.Since query tou-agg-2 does use max we do not apply our sweeping technique (see Appendix E.3).PG-Nat's native implementation of the split operator results in 30% better performance for this query.Query tou-join-agg applies an inequality self-join over an aggregation result (≈ 100k rows under snapshot semantics) and then applies a final aggregation to the join.The large size of this join result is the main reason

Summary
Our experiments demonstrate that an SQL-based implementation of multiset coalescing is feasible -exhibiting runtimes linear in the size of the input, albeit with a relatively large constant factor.We expect that it would be possible to significantly reduce this factor by introducing a native implementation of this operator.Using pre-aggregation during splitting, our approach significantly outperforms native implementations for aggregation queries.DBX uses merge joins for temporal joins (interval overlap joins) which is significantly more efficient than hash joins which are employed by Postgres and DBY.This shows the potential of integrating such specialized operators with our approach in the future.For example, we could compile snapshot queries into SQL queries that selectively employ the temporal extensions of a system like DBX.

CONCLUSIONS AND FUTURE WORK
We present the first provably correct interval-based representation system for snapshot semantics over multiset relations and its implementation in a database middleware.We achieve this goal by addressing a more general problem: snapshot-reducibility for temporal K-relations.Our solution is a uniform framework for evaluation of queries under snapshot semantics over an intervalbased encoding of temporal K-relations for any semiring K.That is, in addition to sets and multisets, the framework supports snapshot temporal extensions of probabilistic databases, databases annotated with provenance, and many more.In future work, we will Push Through Multiplication: Analog to the proof for addition, we prove this part by showing that snapshot equivalence of inputs implies snapshot equivalence of outputs for multiplication.
Based on the fact that timeslice is a homomorphism KT → K which we will prove in Theorem 6.3, time slice commutes with multiplication and addition: Proof of Theorem 6.2.We have to show that the structure we have defined obeys the laws of commutative semirings.Since the elements of KT are functions, it suffices to show k(I) = k (I) for every I ∈ I to prove that k = k .For all k, k ∈ KT and I ∈ I: Addition is commutative: Addition is associative: Zero is neutral element of addition: Multiplication is commutative: Multiplication is associative: One is neutral element of multiplication: Commutes with multiplication: Let n1, . . ., n l denote the elements k(I) for all intervals from the set of intervals with T ∈ I and k(I) = 0. Analog, let m1, . . ., mo bet the set of elements with the same property for k .Then the sum can be rewritten as: replacing this again with the interval notation we get: Proof of Lemma 6.4.injective: We have to show that for any two snapshot K-relations R and R , ENCK (R) = ENCK (R ) ⇒ R = R .Since, CK preserves snapshot equivalence and is a unique representation of any temporal K-element T , it is sufficient to show that for all t, we have TR,t = T R ,t instead.For sake of contradiction, assume that there exists a tuple t such that TR,t = T R ,t .Then there has to exist T ∈ T such that TR,t([T, T + 1)) = T R ,t ([T, T + 1)).However, based on the definition of TR,t this implies that R(T )(t) = R (T )(t) which contradicts the assumption.surjective: Given a KT -relation R, we construct a snapshot Proof of Lemma 6.5.By virtue of snapshot equivalence between CK (T ) and T and based on the singleton interval definition of TR,t in ENCK , we have for any tuple t: Proof of Theorem 6.6.We have to show that (DBK T , ENCK −1 , τ) fulfills conditions (1), (2), and (3) of Definition 4.5 to prove that this triple is a representation system for K-relations.Conditions (1) and ( 2) have been proven in Lemmas 6.4 and 6.5, respectively.Condition (3) follows from the fact that τT is a homomorphism (Theorem 6.3) and that semiring homomorphisms commute with RA + -queries ( [20], Proposition 3.5).
Proof of Theorem 7.1.To prove that KT has a well-defined monus, we have to show KT is naturally ordered and that for any k and k , the set {k | k K T k + k } has a unique smallest element according to K T .A semiring is a naturally ordered if K T is a partial order (reflexive, antisymmetric, and transitive).
Note that we only have to prove that K T is antisymmetric, since reflexivity and transitivity of the natural order follows from the semiring axioms and, thus, holds for all semirings.Antisymmetric: We have to show that ∀k, k ∈ KT : for all T ∈ T which can only be the case if k ∼ k .Since k and k are coalesced it follows that k = k .Unique Smallest Element Exists: It remains to be shown that {k | k K T k + k } has a smallest member for all k, k ∈ KT .We give a constructive proof by constructing the smallest such element kmin.kmin is defined by coalescing an element kpmin that consists of singleton intervals ([T, T + 1)) as follows: Substituting the definition of kmin and using the fact that τT commutes with addition, for every time point T we distinguish two cases.Either τT (k ) K τT (k) in which case τT (k) −K τT (k ) = 0K and we have: Otherwise for τT (k ) It remains to be shown that kmin is minimal.For contradiction assume that there exists a smaller such member k alt .Then there has to exist at least one time point T such that τT (k alt ) ≺K τT (kmin).We have to distinguish two cases.If τT (k) K τT (k ), then τT (kmin) = 0K .However, since 0K ≤ k for any k ∈ K this leads to a contradiction.Otherwise τT Proof of Theorem 7.2.We have to prove that τT Proof of Theorem 7.3.By construction, the result of aggregation is a NT relation (it is coalesced).Also by construction, we have τT Proof of Theorem 8.1.To prove the relationships in the commutative diagram of Equation (1), we have to prove that PERIODENC −1 (PERIODENC(R)) = R and that queries commute with PERIODENC if rewritten using REWR, i.e., PERIODENC( Selection: op = σ θ : A selection is rewritten as Q = C(σ θ (REWR(Q))).Consider an input tuple t from R. The temporal K-element T annotating tuple t is represented as a set of tuples of the form (t, I + , I − ) for some interval I.If t fulfills the selection, then t is annotated with T in the result.In R , all of these tuples are in the result of Q if t |= θ and applying PERIODENC −1 we get T as the annotation of t.If t does not fulfill the condition then t is annotated with 0 in both encodings.Projection: op = ΠA: A projection is rewritten by adding the attributes encoding the interval associated to a tuple to the projection expressions.There will be one tuple (t, I + , I − ) in the result for each interval I assigned a non-zero annotation in R(u) for any tuple u projected on tuple t.Function PERIODENC −1 creates the annotation of an output as a temporal element that maps each interval mapping to a non-zero annotation in PERIODENC(R) to that annotation.This corresponds to addition of singleton temporal elements and based on the fact that addition is associative this implies that the annotation of t in the output will be the sum of temporal elements R(u) for each u projected onto t.Thus, the claim holds.Aggregation: op = γ f (A) : The rewrite for aggregation without group-by utilizes the split operation N we have defined.Note that N ∅ returns a NT -relation S where for any pair of tuples t and t and any pair of intervals I1 and I2 we have That is, all intervals with nonzero annotations from any pair of temporal elements do not overlap or are the same.From that follows that for any two time points T1, T2 ∈ I for an interval I that is mapped to n = 0 in the annotation of at least one tuple S, the value of the result of aggregation is the same for the snapshots at T1 and T2.Thus, grouping by the interval boundaries yields the expected result with the exception of an empty snapshot.However, since a tuple (0 f , Tmin, Tmax) is added to the input, the aggregation will produce 0 (count) or NULL (other aggregation functions) for intervals containing only empty snapshots.This does not effect the result of the aggregation for non-empty snapshots, because 0 f is the neutral element of the aggregation function f .Aggregation: op = Gγf (A) : For aggregation with group-by, split is applied grouping on G and no additional tuple (0 f , Tmin, Tmax) is added to the input.Since the tuples within one group are split, the argument we have used above for aggregation without group-by applies also to aggregation with group-by.
For binary operators WLOG let Q = op(Q l , Qr) where the total number of operators in Q l and Qr is n.Join: op = Q l θ Qr: Consider a tuple t that is the result of joining tuples u and v. Let Tu and Tv be the temporal elements annotating u and v in the input, respectively.Based on the definition of the rewriting, in the result of the rewritten join there will be a tuple t, I + , I − annotated with Iu,Iv Q l (u) • Qr(v) for all intervals Iu and Iv such that I = Iu ∩ Ic.This corresponds to the definition of multiplication (join) in NT .Union: op = Q l ∪ Qr: Union is rewritten as a union of the rewritten inputs.For any tuple t, let T l = Q l (t) and Tr = Qr(t).In the result of the union applied by Q a tuple (t, I + , I − ) for each interval I will be annotated with T l (I) + Tr(I).The result of the union is then coalesced.Applying PERIODENC −1 the annotation computed for t is equivalent to C N (T l +K P Tr).Difference: op = Q l − Qr: A difference is rewritten by applying difference to the pairwise normalized inputs.Recall that the monus operator of NT associates the result of the monus for N to each snapshot of a temporal N-element.Since the split operator adjusts intervals such that there is no overlap, the claim holds.

C. PULLING-UP COALESCING
The main overhead of our approach for snapshot temporal queries compared to non-temporal query processing is the extensive use of coalescing, which can be expensive if naively implemented in SQL.Furthermore, the application of coalescing after each operation may prevent the database optimizer from applying standard optimizations such as join reordering.To address this issue, we now investigate how to reduce the number of coalescing steps.In fact, we demonstrate that it is sufficient to apply coalescing as a last step in query processing instead of applying it to intermediate results.Similar optimizations have been proposed by Bowman et al. [12] for their multiset temporal normalization operator and by Böhlen et al. [10] for set-coalescing.
Consider how a RA + query Q is evaluated over an KTdatabase.RA + over K-relations computes the annotation of a tuple in the result of a query using the addition and multiplication operations of the semiring.That is, the annotation of any result tuple is computed using an arithmetic expression over the annotations of tuples from the input of the query.In the case of a semiring KT , addition and multiplication are defined as coalescing a temporal element that is computed based on point-wise application of the addition (multiplication) operations of semiring K (denoted as +K P and •K P ).Recall from Lemma 6.1 that coalescing can be redundantly pushed into the addition and multiplication operations of interval-temporal semirings, e.g., CK (k +K P k ) = CK (CK (k) +K P k ).Interpreting this equivalence from right to left and applying it repeatedly to an arithmetic expression e using +K T and •K T , the expression can be rewritten into an equivalent expression of the form CK (e ), where e is an expression that only uses operations +K P and •K P .Now consider expressions that also include applications of the monus operator −K T .This operator is defined as CK (k −K P k ).The −K P operator computes the timeslice of the inputs at every point in time and then applies −K to each timeslice.According to Lemma 5.1, τT (k) ∼ τT (CK (k )).Thus, the result of −K P is independent of whether the input is coalesced or not.
Lemma C.1.Any arithmetic expression e using operations and elements from an period m-semiring KT is equivalent to an expression of the form CK (e ), where e only contains operations +K P , •K P , and −K P .
Lemma C.1 implies that it is sufficient to apply coalescing as a last step in a rewritten query REWR(Q) instead of after each operator.
Corollary C.2 (Coalesce Pullup).For any RA query Q, REWR(Q) is equivalent to a query Q which is derived from REWR(Q) by removing all but the outermost coalescing operator.
Proof.Operations +K T , •K T , and −K T are defined as applying CK to the result of operations +K P , •K P , and −K P , respectively.Thus, expression e is equivalent to an expression that interleaves the CK as well as +K P , −K P , and •K P operations.To prove this, we first prove that the following equivalence holds: Consider the definition of −K P .Every interval I = [T, T + 1) is assigned the annotation τT (k) −K τT (k ).Applying Lemma 5.1 we get τT (k) = τT (CK (k)) and τT (k ) = τT (CK (k )).Thus, the equivalence holds.By repeatedly applying this equivalence and the equivalences proven in Lemma 6.1, all except the outermost K-coalesce operations can be removed resulting in an expression of the form CK (e ) where e does not contain any coalesce operations.
Example C.1.Consider the following query Q = S − Π sal (σ sal<sal (S × ρ sal ←sal (S)) that returns the largest salary from relation S as shown in Figure 3 (consider the corresponding NT -relation using the annotation shown on the right in this figure coalesced as shown in Example 5.3).Consider how the annotation of tuple r = (50k) in the result of Q is computed.Applying the definitions of difference, projection, and join over K-relations and denoting the database instance of S as D, we obtain: Pulling up coalesce we get:

D. INTERACTION OF OUR APPROACH WITH QUERY OPTIMIZATION
In this section we briefly discuss the impact of our rewrite-based approach for implementing snapshot semantics on query optimization.Importantly, the combination of uniqueness and snapshot reducibility guarantees that queries are equivalent wrt.our logical model precisely when they are equivalent under regular Krelational semantics.As a special case of this result, queries over NT relations are equivalent iff they are equivalent under bag semantics (N-relations).That is, any query equivalence that is applied by classical database optimizers, e.g., join reordering, can be applied to optimize snapshot queries.
That being said, we pass a rewritten query to the DBMS optimizer which is not aware of the fact that this query implements snapshot semantics.The preservation of bag semantics query equivalences does not necessary imply that these rewritten queries can be successfully optimized by a general purpose query optimizer.However, as we will explain in the following, our approach is designed to aid the database optimizer in finding a successful plan.First off, note that our rewrites essentially keep the structure of the input query intact with the exception of the introduction of split before aggregation and difference, and coalescing which is applied as a final step for every snapshot query.Every other operator is preserved in the rewritten query, e.g., joins, are rewritten into joins.This query returns for each person the city(ies) they live in.Applying REWR we get the query shown in Figure 6.Note how the structure of the input query was preserved.The exception are the coalescing operator at the end and the introduction of new projections.However, typically database optimizers will at least consider a transformation called subquery pull-up (called view merging in Oracle) which would pull-up and merge these projections.Thus, these projections do not hinder join reordering.

E. SQL IMPLEMENTATIONS OF BAG CO-ALESCING AND SPLIT
In the following, we explain our implementation of bag coalescing in SQL using a step by step example.Afterwards, we present the implementation of the split operator integrated with aggregation and (bag) difference.

E.1 Bag Coalesce
Figure 8 shows the SQL code for computing bag coalescing for a table recording the activity of production machines.Figure 9 shows an example instance of this table and the intermediate and final results produced by the query for this instance.Here we assume that periods are stored as two timestamp attributes tstart and t end recording the start and the end of the period.The input table active with the schema (mach, tstart, t end ) is shown on the topleft of Figure 9.Each row in the table records a time interval (from tstart to t end ) during which a machine (mach) is running.For convenience we show a timeline with the intervals encoded by this table.
Before explaining the steps of the SQL implementation, we review bag coalescing.To coalesce an input we have to determine for each tuple t its annotation change points, i.e., the end points of maximum intervals during which the multiplicity of the tuple does not change.Then for each adjacent pair of change points we output a number of duplicates of tuple t that is equal to the number of duplicates of tuple t in the input whose intervals cover the two change points.This could be implemented as a native operator which splits each tuples associated with a period into two tuples with the intervals end points where each generated tuple is marked to indicate whether it represents an interval start or end point.Then any aggregation algorithm can be applied to calculate the number of intervals associated with a tuple that open and close at a particular time point.The output of this step is then sorted on the non-temporal attributes and secondary on the timestamp attribute.The final result is produced by scanning through the sorted output once outputting for each tuple and adjacent pair of change points a number of duplicates determined based on the number of intervals covering these change points which is determined based on the counts of opening and closing intervals.We leave a native implementation and further optimizations (e.g., we could partition the input on the non-temporal attributes and then process multiple such partitions in parallel) to future work and now explain how our SQL implementation realizes the computational steps outlined above.
Determine the Number of Opening and Closing Intervals Per Change Point.In lines 3 -24 of Figure 9 we compute the annotation change points for each tuple and the number of intervals that are opening and closing for each such change point.This is done by counting for each tuple and one of its change points the number of opening and closing intervals separately and then for each such pair merge the number of opening and closing counts into a single output tuple.Note that strictly speaking not all of the time points returned by this query are guaranteed to be annotation change points.The actual change points are computed in one of the following steps as explained below.For the example instance there is only one pair (M 1, 5) where time point T (attribute t) is both the start and end point of an interval associated with tuple (M 1).As another example consider time point 6 which is the end point of two intervals associated with tuple (M 2) corresponding to the last tuple in the result of subquery change points.The pre-aggregation before the union is merely a performance tweak.It turns a single aggregation over 2 • |active| tuples into two aggregations over |active| tuples.
Counting Open Intervals.Line 26 -35 of the query create the common table expression num intervals which returns the number of open intervals for a tuple per time point T .This is achieved by subtracting the number of intervals for this tuple with an end point that is less than or equal to T (attribute t) from the number of intervals with a start point that is less than or equal to T .Intuitively, the number of open intervals for a tuple is the number of duplicates of the tuple that exist in the time interval between T and the adjacent following change point.We compute these running sums using SQL's window functions partitioning the input on the non-temporal attributes (mach in this example) and within each partition order the tuple based on the timestamp t computing the aggregate over a window including all tuples with a timestamp less than or equal to t.For example, consider the first tuples in the instance of num intervals as shown in Figure 9.This tuple records that there two duplicates of tuple (M 1) exist at time point 1.
Removing Spurious Change Points.Recall that bag coalescing determines maximal intervals during which the annotation (multiplicity in the case of bag semantics) of a tuple is constant.As shown in the example, num intervals may contain adjacent time points with the same number of open intervals which, according to the definition of K-coalescing, are not annotation change points.Subquery diff previous (Lines 38 -47) computes the difference between the number of open intervals at a time point and the previous time point.Subquery changed intervals (Lines 49-53) removes tuples where this difference is zero (the number of duplicates has not changed).
Reconstructing Intervals.At this point in the computation we have calculated the set of annotation changepoints for each tuple and the number of duplicates of the tuple that exist during the time interval between each two adjacent annotation change points.Subquery pair points (Lines 55-65) computes pairs of adjacent annotation change points.For instance, the first tuple in the result for the example records that there exists two duplicates of tuple (M1) during time interval [1,7].The subquery of pair points also returns tuples with tstart equal to the last change point of teach tuple.These tuples are filtered in the WHERE clause of the outer query.For example, the tuples marked in red in the result of the subquery as shown in Figure 9 are such tuples.
Generating Duplicates.In the last step, we generate duplicates of tuples based on the counts stored in attribute #open.One way to realize this would be to use a set-returning function that takes as input a tuple t and a count c, and returns c duplicates of t.While perfectly viable, to avoid the overhead of calling a userdefined function for every distinct output tuple, we use subquery max seq (Lines 67-72) to generate a table storing a sequence of numbers {1, . . ., m} where m is the maximum number of duplicates of any tuple in the query result.We then join this table with pair points (lines 74-76).For the example database the maximum number of duplicates for any tuple and time point is 2. Hence, subquery max seq returns {(1), (2)}.The final join with pair points then returns the appropriate number of duplicates for each tuple using the counts stored in #open, e.g., there are 2 duplicates of tuple (M 2) during time interval [3,6].

E.2 Split Operator Implementation
We first introduce our implementation of the split operator and then afterwards discuss the optimized versions of the aggregation and difference which incorporate split.We show the SQL code for N mach (active, active), i.e., splitting the intervals of relation active based on its own interval boundaries for attribute mach (the only attribute of this relation).Figure 10 shows the SQL code generated by our system and Figure 11 shows an example database the intermediate results of produced by the SQL implementation for this example.In lines 2-9 we assign aliases to the left and right input.For this particular example, both inputs are table active.
Computing Interval End Points.The first of the computation (lines 10-23) generates the set of all interval end points for both inputs.Note that for this example where a relation is split wrt.itself the four-way union is not necessary and can be replaced with a two-way union.In our implementation we apply this optimization, but for sake for the example we show the four-way union to illustrate how the approach would work when a relation is split wrt. to another relation.
Creating Unique Identifiers for Intervals.Next we assign a unique identifier to each tuple from the left input (lines 25-32).For instance, there are three such tuples in the example shown in Figure 11.
Pair Intervals with End Points.We now join the left input with all endpoints we have computed beforehand (lines 34-46) such that each interval from the left input is paired with all end points it contains with the exception of the maximum point in the interval.Intuitively the purpose of this step is to creating sufficiently many duplicates of each input tuples to be able to generate the split versions of the interval for this tuple.Furthermore, the end points we have paired with an interval will be the starting points of the split intervals.In Figure 11 we highlight tuples with colors to indicate which tuples correspond to the same input interval.For example, the first two tuples in the result of split points correspond to the tuple with id 1 and the starting points of the two intervals this interval will be split into (end point 4 is contained in the interval [1,7]).
Generating Split Intervals.Finally, we adjust the start (tstart) and end points (t end ) of each interval produced in the previous step (lines 48-52).The start point is set to the time point t (the time point from the set of interval end points we have paired with the interval) and the end point is the next larger time point associated with the same interval identifier (or the end point of the interval is no such time point exists).For example, for the first tuple from the result of subquery split points we output tuple (M1,1,4).As can be seen in the timeline representation of the result shown on the bottom right of Figure 11 in the result of split any two intervals associated with the same values of the non-temporal attributes are either equal or disjoint.

E.3 Combining Split with Temporal Aggregation
There is synergy in combining the split operator with temporal aggregation.The resulting implementation is similar to temporal aggregation algorithms which utilize end point indexes (e.g., aggregation over a timeline index [25]).These approaches calculate the result of an aggregation function over time using "sweeping" by sorting the endpoints of intervals on time and then scan over the data in sort order adding the values of tuples whose intervals start at the current point in time to the current aggregation result and subtract the values of tuples whose intervals end at this point in time.Note that this only can be applied to aggregation functions like sum and count where it is possible to retract a value (the underlying function, e.g., addition in the case of sum, has an inverse).For aggregation functions min and max it is necessary to maintain a list of previously seen values (although it is not necessary to keep all previous values [31]).We do not use the sweeping technique for min and max, but still apply the pre-aggregation optimization described below.We explain how to combine split with aggregation using the example query shown in Figure 13 which computes the average consumption (consum) of machines.
Pre-aggregation.For aggregation functions like sum and count that are commutative, associative, and where the underlying operation has an inverse, we can compute pre-aggregate the input data before computing split points.For that we group on the input query's group-by attribute plus the attributes tstart and t end which store the end points of a tuple's period.The pre-aggregation step return partial aggregation results for each list of group-by attribute values and period that occurs with this group.During split these periods may be further subdivided and the final aggregation results will be computed by accumulating results for these subdivisions.For aggregation functions like average that do not fulfill the conditions required for pre-aggregation, but which can be computed by evaluating an arithmetic expression over the result of other aggregation functions that do, we can still apply this trick to calculate the other aggregation functions and delay the computation of the aggregation we are actually interested in until the end.For example, the query shown in Figure 13 computes an average that can be computed as sum/count.Thus, as shown in lines 2-12 of Figure 12 we compute two aggregation functions grouping on mach, tstart, and t end .The example instance of table active contains two tuples belonging to the same group which also have the same period: (M1,10,1,5) and (M1,20,1,5).Based on these two tuples we compute the pre-aggregated result (M1,30,1,5).Note that no matter what aggregation function we are computing, we always will also compute count since it is needed later in the implementation to determine intervals without results for aggregation with group-by.
Calculate Increase and Decrease of Aggregation Values.Our approach for computing aggregation functions sum and count uses a sweeping technique which scans over the set of all interval end points paired with in time order.We keep a partial aggregation result and for each time point adds the values of the aggregation input attribute for tuples with intervals that open at this time point and "retracts" the values of aggregation input attributes for tuples with intervals that close at this time point.For this purpose, we aggregate to total increase (opening intervals) and retraction (closing intervals) for each time point and group.Consider lines 14-38 in Figure 12.Since we are computing aggregation functions sum and count, we store for each time point the increase/decrease for both functions.For that, we use attributes add c and dec c (count) and add s and dec s (sum).For interval start points we set attributes recording decrease to 0 while for end points we points we set the add * attributes to 0. Afterwards, we compute the total increase and decrease per time point using aggregation.For instance, consider time point 5 in the example shown in Figure 14.Two intervals with a total consumption of 30 close at this time point and one new interval opens with a consumption of 40.This is encoded in the third tuple (M1,1,40,2,30,5) in the result of subquery increase decrease.
Compute Accumulative Totals.We then calculate the aggregation function result for each group and each point in time where at least one interval for this group starts or ends as the sum of the increases up to and including this point in time and subtract from that the sum of decreases.For example, the third tuple in the result of subquery accumulation shows that at time 5 there are 2 open intervals with their consum values summing up to 80. Generate Output Intervals.Finally, we pair each split point and its count and sum with the following split point to produce output intervals and compute the average as the sum divided by the count.This is realized by the inner query of the subquery shown in lines 56-69 in Figure 12.Note that it may be the case that no periods start at a given split point.In this case the count would be 0 (no intervals open during between this time point and the next split point).This is dealt with by the WHERE clause of the outer query which filters out tuples where the count is 0.
Aggregation Without Group-by.Recall that for aggregation without group-by we have to return results for time periods where the relation is empty.This is easily achieved in our implementation by adding a dummy interval [Tmin, Tmax] associated with the neutral value of the aggregation function to the result of subquery pre agg (0 for count and null otherwise).For time periods where the input relation is empty the split operator creates an interval covering the "gap" and will return the value we did associate with the dummy interval which is chosen to correspond to the result of an aggregation over an input relation as defined in the SQL standard.For periods where the input is non-empty the result is not affected since the dummy interval is associated with the neutral value of an aggregation function.An additional change that is required is that the final WHERE clause (Figure 12, line 68) has to be changed to t end IS NOT NULL to (i) return results for gaps (where the count is 0) and not return a tuple where tstart is the last split point (equal to Tmax for the case of aggregation without group-by).

E.4 Combining Split with Difference
To explain the combined implementation of split with bag difference we evaluate the example query shown in Figure 16 under snapshot semantics.The query returns all machines and their consumption removing consumptions of machines which have been incorrectly recorded (table faulty).The SQL implementation for the snapshot version of this query which uses combined split and difference is shown in Figure 15.We show an example instance and intermediate results for the query in Figure 17.We combine the split operator with bag difference by reducing bag difference to the problem of count aggregation.Consider a snapshot at time T and tuple t and assume that t appears in the left input with multiplicity n and in the right input with multiplicity m at T .Then we have to return max(0, n − m) duplicates of tuple t for this snapshot.This can be achieved by computing counts for each interval end point in the left and in the right input and then subtracting the counts of the right hand side from the counts of the left hand side.The combination of counting and split essentially uses the approach described in Appendix E.3.
Computing Changes in Multiplicities.Subquery end point counts (Figure 15, lines 15-36) computes the number of opening and closing intervals for both inputs.We count the end points from the right input negatively.For instance, in the example the second tuple in the result of this subquery records that there are two opening intervals for tuple (M1, 40) at time 1.
Aggregate Multiplicities.Next, we use subquery acc counts aggregate the multiplicities to get a single count of opening and closing intervals per time point (lines 49-63).Note that in the result of this subquery both #open and # close may be negative.This has to be interpreted as that there is a larger number of opening/closing intervals from the right input than the left input.
Generating Intervals.We now pair adjacent time points (lines 49-63) to create intervals and compute the final multiplicity for each tuple.
Final Result.To compute the final result of the difference operator we have to create the right amount of duplicates for each tuple.The method we apply here is exactly the same as the one applied for aggregation: we join the result of subquery intervals with a table contain numbers 1 to n where n is the maximum multiplicity across all tuples and time points.

E.5 Coalesce after Split
We can also apply coalesce (introduced in Section E.1) after split (introduced in Section E.2), for example, we apply split the table active with the schema (mach, tstart, t end ) and apply coalesce afterwards, Figure 7

Figure 1 :
Figure 1: Snapshot semantics query evaluation -highlighted tuples are erroneously omitted by approaches that exhibit the aggregation gap (AG) and bag difference (BD) bugs.

Example 3 . 3 (
Implementation).Consider the SQL period relations shown on the top of Figure2.Each interval-annotation pair of a temporal N-element in the logical model is encoded as a separate tuple in the implementation.For instance, the annotation of tuple (Ann, SP ) from the logical model is encoded as two tuples, each of which records one of the two intervals from this annotation

Example 5 . 2 .
Reconsider the temporal N-element T1 from Example 5.1.Recall that intervals not shown are mapped to 0. The N-elements shown below are snapshot-equivalent to T1.

Figure 3 :
Figure 3: Example period multiset relation S and temporal Nelements encoding the history of tuples.

Figure 4 :
Figure 4: Rewriting REWR that reduces queries over NT to queries over a multiset encoding produced by PERIODENC.
Using k ∼ k and substituting the definition of ∼, i.e., T ∈I k(I) = T ∈I k (I), we get: I) +K k (I))

Proof of Theorem 6 . 3 .
Proven by substitution of definitions: Preserves neutral elements: CK (τT (0K T )) = I∈I:T ∈I 0K T (I) = I∈I:T ∈I 0K = 0K τT (1K T ) = I∈I:T ∈I 1K T (T ) Since T ∈ [Tmin, Tmax) for any T ∈ T and 1K T (I) = 0K for any interval I except for [Tmin, Tmax) where 1K T ([Tmin, Tmax)) = 1K we get I∈I:T ∈I 1K T (T ) = 1K Commutes with addition: be a NT -relation and R denote PERIODENC(R).Consider an arbitrary tuple t and let T denote the temporal element associated with t, i.e., R(t) = T .Consider any interval I ∈ I and let nI = T (I) (the multiplicity assigned by T to I).According to Definition 8.1, this implies that tuple tI = (t, I + , I − ) is annotated with nI .Let Tt denote the temporal element assigned by PERIODENC −1 to t.By construction Tt(I) = nI = T (I).PERIODENC(Q(R)) = REWR(Q)(PERIODENC(R)): We prove this part by induction over the structure of a query.Let R = PERIODENC(R).Base case: Assume that Q = R for some relation R. The claim follows immediately from REWR(R) = R. Induction Step: Assume the claim holds for queries with up to n operators.We have to prove the claim for any query Q with n + 1 operators.For unary operators, WLOG let Q = op(Qn) for an operator op and query Qn with n operators and let Q = REWR(Q).

Table 1 :
Interval-based approaches for snapshot semantics.

Table 2 :
Number of query result rows

C
Πname,city,A begin ,A end Π name,age,pN ame,address,max(person.A begin ,livesIn.A begin ),min(person.A end ,livesIn.A end ) name=pN ame∧person.A begin <livesIn.A end ∧livesIn.A begin <person.A end Π name,age,pN ame,address,aId,city,zip,street address=aId∧person.A begin <livesIn.A end ∧livesIn.A begin <person.A end shows this workflow.Count the open intervals per tuple and time point 26 num_intervals (mach, #open, t) AS Compute the difference between the number of open 37 --intervals at t and at the previous change point 38 diff_previous (mach, #open, diffPrevious, t) AS Pair each change point with the following change point 55 pair_points (mach, #open, tstart, t end ) AS Create the right number of duplicates for each tuple 74 SELECT mach, tstart, t end 75 FROM pair_points p, max_seq s 76 WHERE p.#open >= s.nFigure 8: Applying the SQL implementation of bag coalescing to the example table active.Figure 9: Example database and intermediate results of the query implementing bag coalescing for table active.23Figure17: Example instance of table active and intermediate results of the query implementing split + bag difference for the query from Figure 16