interference: opensource java-based distributed database

How much storage size can the cluster support with?

We have currently successfully tested the operation of a cluster with more than 10TB of storage which contains one of tables with size over 1TB (over 10,000,000,000 rows). This configuration requires more than 64GB of heap memory to launch one node.

Is it an in-memory database or not?

The concept of a cluster is to ensure performance with as much storage as possible with a limited amount of heap memory. The cluster will make the most optimal use of all the heap memory provided to improve performance when possible

Are DDL operations supported?

No, and not planned. Conceptually, the internal data structure relies on JPA-annotated fields of the source classes.

How can I add a field to a table or delete it?

For all once registered @Entity classes, the set and sequence of fields marked with the @Column annotation cannot be subsequently changed. Now we do not support any automatic procedure like scanning changes in annotated fields and bringing the persistent data structure in line. A temporary solution may be to register a table next to it with additional fields with a different name and copy data from the old table.

Are unique constraints supported?

Yes, the @Id annotation automatically guarantees the uniqueness of the inserted value. In some cases, if the guarantee of uniqueness can be neglected, and the insertion speed is more important, you can use the @NoCheck annotation for this field, the use of which assumes that there are no checks when inserting, which in particular allows you to not use indexes and get maximum performance when inserting data (see further CEP)

Are indexes supported?

Yes, indexes are supported. Moreover, it is highly recommended to use the index on the field with the @Id annotation (otherwise there will be an extremely strong performance drop upon insertion due to full table scans) and on the fields that are used in linking tables when querying. However, you need to understand that a large number of indexes on table decrease performance when inserting.

Are foreign keys supported?

Not currently supported.

Are transactions supported?

Yes, transactions are supported for all DML operations. The transaction starts automatically; the completion is controlled by the commit and rollback commands of the session API. READ COMMITTED data reading mode is always supported for queries.

What are JPA annotations used for? Is it ORM or am I not understanding something?

JPA annotations are used to tell the interference mechanisms about the persistent structure of a class. If we consider interference as a kind of ORM, then it will be a self-sufficient direct-to-disk ORM.

Are table join queries supported?

Yes, supported. Depending on the size of the tables to be linked and the availability of indexes on the fields, the conditions for linking can be applied by various union methods - nested loops, right index scan, right hash join, merge join.

What are the pros and limitations of using a cluster configuration?

Conceptually, the interference cluster was originally designed with the following conditions: the ability to insert data from any node and the ability to run a distributed request from any node. Subsequently, this gave rise to both certain advantages of this solution and limitations that we will also try to consider.

How is data replicated in a cluster?

The inserted data is immediately replicated to all available nodes. If any node is temporarily unavailable, then a persistent queue is stored on the original node for it, which will be replicated immediately upon the availability of the node.

How do distributed queries work?

By default, all requests are distributed, i.e. will be executed in parallel on all available nodes. But for a number of queries (for example, a query of type select * from table where id = value for a medium-sized table with an index), this path may not be optimal, because the cost of inter-node interaction may be higher than the query execution cost on a single node. Therefore, it is recommended to choose a distributed method only for queries that perform scans on large tables or indexes and merge with such scans.

Does a remote node crash during distributed query execution affect its result?

No, the request will be executed in any case, even if all the catch except the current one (the local request node) falls, for heavy requests the execution time will simply increase accordingly.

How do transactions in a cluster work?

It is completely transparent. The node initiating the transaction must confirm or cancel it. For other nodes, data will be available only after transaction confirmation.

How do data changes and deletions in a cluster work?

There are some limitations: the persistent interference model is constructed in such a way that the node on which the insertion of a certain amount of data was carried out is subsequently the owner of this data in the sense that any change or deletion of this data can be made only from the same node. This is done in order to exclude distributed locks, which will reduce performance, but in turn limits us to changing data from only a single node.

How is CEP implemented?

Conceptually, the implementation of CEP is simple and based on two concepts: any table is also a persistent queue and the ability to run select stream to read data from this queue (including conditions and grouping).

Is data transfer order guaranteed?

Yes, as it was said above, any table is a queue and stream query subtracts data from it in the order in which they were inserted. This is guaranteed by the persistent interference mechanism and is independent of the order of the data in the primary key field (@Id).

How do I connect to a running instance?

There are two ways to use interference: session API, described in detail in the documentation and for which test examples are given. It allows you to do a lot of necessary things - register data classes, perform DML operations, execute queries and manage transactions. The second way is direct access to persistent data. The difference between this approach and the first is also discussed in the documentation.

Is JDBC supported?

No, and not planned. Interference is not a standalone database.

Is it possible to connect any third-party consumer to read events?

In the near future, we plan to release a light version of interference capable of working in consumer mode only, without creating a physical base, etc., exclusively for CEP event processing.

Can I send events to a specific node?

No, because this contradicts the basic concept - data should be available to each node of the cluster.

What are the differences between reading from a queue and reading from a table?

There is no difference at the persistent storage level. The differences are that SELECT STREAM supports DIRTY READS, i.e. reads physically added records, regardless of session.commit() use.

This is done in order to avoid creating a huge number of transactions where they are not needed. However, you need to understand that periodic commit() on the insertion node is necessary for frame recycle.

How is the queue freed from unnecessary data?

There are two options, the first is to use the @Threshold annotation at the table level with the number of actual frames, after which old frames will be deleted starting from the first. This option is suitable for a queue for which periodic group queries are performed without explicitly deleting records (works at the cluster level). The second is using session.purge(), which explicitly physically removes the entry from the queue. This option is suitable for mass processing of events to prevent repeated subtraction; it works only on the local node on which data is inserted. Attempting to execute purge() on any other node will throw an exception.

How to understand that a record has already been read from the queue?

Conceptually, stream query is designed to read a record from a queue once. When you restart it, it also considers all entries from this queue, except those that were explicitly deleted. Thus, if the record is processed in this session and is no longer needed - delete it using purge(), but it works only on the node where the insert is performed.

What CEP semantics are supported?

From the above it can be concluded that for different options of insert/read is supported at least once and exactly once.

Is it possible to addressfully execute a request on a specific remote host?

Currently, there are two options for query execution - strictly local and distributed across all nodes of the cluster. If you justify us the need to execute a request on any of the remote nodes, perhaps we will do it.

interference

opensource cluster

Q & A