Post by Jake BlufordOn Mon, 30 Nov 2015 22:25:23 +0100, Marcel Mueller
Post by Marcel MuellerWhat are the requirements that need a relational data models?
Well just to think up a example that applies to many business
applications, imagine something like an account on an order that has a
contact.
Order #1234 was placed in the year 2010 by CompanyXYZ in Omaha
Nebraska, principal contact is John Smith, phone number is 555-6666.
That order was for five WidgetTypeA products.
If you order more than one item at once the model is no longer
relational. Now it contains a list of order items. Creating a new entity
for order items and repeating the order ID over and over for them is the
work around not part of the requirement.
Post by Jake BlufordMany other customers ordered WidgetTypeA. But unfortunately, it is
realized that WidgetTypeA is somehow defective, and needs to be
recalled as a matter of life or death, or perhaps it is deemed to
violate some state law that applies in California only.
How do we identify all orders for WidgetTypeA in California?
Place a query to the orders with an appropriate .Where condition. E.g.
Query<Orders>().Where(o => o.Articles.Contains(WidgetTypeA) && State ==
California)
Post by Jake BlufordWithout
referential integrity that provides foreign key reference to the NEW
company as it exists in 2015, we have no real link to it, they used to
be in Nebraska.
Of course, companies, articles and orders are /different/ entities and
foreign keys are still allowed and required. But orders and order items
and maybe multiple schedules are parts of the same object.
Post by Jake BlufordIf the order detail information is buried in some
XML, JSON or similar document hierarchy, drilling down into each
document to run queries to find this kind of information (something
that customer support has to be able to do in real time on a minute by
minute basis) is extremely computationally painful with large amounts
of data, near impossible to keep indexed for query execution speed.
Even indexes are still allowed. In this case an index to the articles in
the Items array is a good advise. In fact a computed property Articles
might refer to Items.Select(i => i.Article) and the index might apply to
.Articles.
Of course, the index types are no longer that simple. You may have
comparable indexes which can access ranges but require the underlying
type to be comparable. This is basically the same than database indexes.
But you also may have hash indexes that only support equality; they can
be amazingly fast. You also may have array indexes that can speed up
.Contains or .Overlaps. In case of articles a hash index probably fits
best, since it usually makes no much sense to apply relational operators
on object IDs.
Post by Jake BlufordThere were some other nightmare scenarios I was going to mention about
changing contacts and phone numbers, and what if the SKU# for
WidgetTypeA changes over time, etc. but I think you get the idea.
In our case the data core engine has the complete history of all changes
so far. This could be cleaned up later, but for now it is complete. Of
course, this is just the option we took. Each time an object is saved a
new revision time stamp is created. This is part of the key. Objects
saved in the same transaction share the same time stamp; these are the
change sets. For each user request a snapshot of the entire database is
taken. This is quite simple, it is just DateTime.Now. The engine only
returns versions that have been valid at this time stamp. All later
changes are invisible to this request, and all current changes are
rejected if the do not modify the most recent version (optimistic locking).
Once saved, the object version becomes immutable. And since all requests
operate an a snapshot slightly in the past all they only access
immutable objects. Immutable objects are inherently thread safe. So all
objects in memory are shared between all parallel requests. This saves
much memory when you have concurrent access to the same data from
different users or parallel tasks. It also improves CPU cache
efficiency. And read only memory access is not sensitive to cache line
hopping on multi core hardware.
Furthermore the data can be compressed at different layers. First of
all, in memory objects and strings could be deduplicated. E.g. a foreign
key instance always takes only one machine size word in .NET (32 or 64
bit) regardless how complex it is. All references point to the same
multiton. The same applies to many attribute values. E.g. document
types, often stored as string, share the same string storage. This
implodes the memory footprint by about a factor 5 in our case.
The history also can be compressed. Usually only a few properties change
on save. So the persistent storage needs only to keep the most recent
version completely. The history entries could only contain delta frames
with backups of the overwritten properties. On deserialization the most
recent version has to be read first and then the delta frames need to be
applied. Since access to outdated version of object currently not in the
memory cache (from the previous write) is very unlikely this is no
significant impact. One could store "I frames" from time to time to
speed up access to old historical data. This saves about a factor 3 on
disk depending on the average number of changes per object.
In fact the data core turned out to be roughly 10 times faster than the
SAP R/3 system with Oracle backend. And it is about 3 orders of
magnitude faster than a php application on WAMP platform that has almost
the same, flexible data model than ours but stored in a transposed
table. By the way, the server is a dual core VM with AFAIR 8GB memory,
CPU load almost always <5%; the php machine in contrast has 16 cores and
16GB memory, several times a day all 16 cores are at 100%. The number of
users is the same since most of them use both applications at once.
Post by Jake BlufordBottom line is that RDBMS were designed to store data efficiently,
allow referential integrity (solving the problem above),
Referential integrity is not only a property of RDBMS.
Post by Jake Blufordfacilitate
efficient indexing/queries,
Well, this point is an issue. SQL optimizers are quite smart nowadays,
especially Oracle. (MSSQL a bit less.)
With the power of LINQ you could easily run into expression that are no
longer understood by the optimizer. This causes either bad runtime or an
exception in case of EF (I hate these exceptions).
Post by Jake Blufordreduce redundancy of storage (not just for
the storage space cost but the efficiency of querying non-redundant
data).
This is another point for document databases. Oracle and MSSQL have 8k
pages. So basically almost all disk I/O is performed with 8k block size.
The SAN statistics show another peak at 64k (probably large pages), but
this one is at least one order of magnitude lower. Given a storage based
on 15k HDDs you get a throughput of about 200 IOPS at 1,6MB/s per
spindle - very bad; even my first 486 computer could read at 6MB/s from HDD.
A document based DB tends to read larger blocks, often including
unnecessary data. But the disks do not care as long as the average block
size is below the critical size of a few MB. The number of IOPS is
almost unchanged, but you do not need that many of them for the same data.
Only if you have fully flash based storage the situation changes. SSDs
can deal much better with small blocks.
Post by Jake BlufordMaybe my understanding of these other systems is wrong, but I think in
those cases all of the data is stored with the document. Fundamentally
it can never provide the benefits of storing a tiny key to the real
data, from query speed to being able to update in a single location
and have the update instantly applied everywhere (without having to
actually do an expensive update or write operation).
There is no major difference with respect to this topics, as mentioned
above.
Marcel