Discussion:
Do you use Entity Framework?
(too old to reply)
Jake Bluford
2015-11-30 02:00:31 UTC
Permalink
Just a quick poll, how many of you are using Entity Framework for your
data access?

If not EF, what do you use (straight ADO.Net, nHibernate etc?)
Arne Vajhøj
2015-11-30 03:28:08 UTC
Permalink
Post by Jake Bluford
Just a quick poll, how many of you are using Entity Framework for your
data access?
If not EF, what do you use (straight ADO.Net, nHibernate etc?)
For ORM I prefer NHibernate over EF but occasionally my preference does
not matter.

Not everything is suitable for ORM so also ADO.NET.

So I guess you can count me as using all 3.

:-)

Arne
Richard Maher
2015-11-30 12:47:43 UTC
Permalink
Post by Jake Bluford
Just a quick poll, how many of you are using Entity Framework for your
data access?
If not EF, what do you use (straight ADO.Net, nHibernate etc?)
NO! And I am sick to death of Microsoft defaulting all of this
cargo-cult crap (like JfuckingQuery) into my project :-(
Marcel Mueller
2015-11-30 19:29:25 UTC
Permalink
Post by Jake Bluford
Just a quick poll, how many of you are using Entity Framework for your
data access?
Sometimes. It is useful to access simple data models that map easily to
.NET POCOs.
Formerly I argued that code first is the best way. Now I see that when I
can use code first, then the object model does not fit well into a
relational DB model anyway and so EF is no much help.
On the other side when I have to access legacy data models. DB first is
sometimes useful.
Post by Jake Bluford
If not EF, what do you use (straight ADO.Net, nHibernate etc?)
Custom LINQ provider with in memory computation, deduplication and
without 1:1 mapping to SQL types (too complicated, no benefit).

From my point of view SQL is outdated. ORM is just a work around, to
keep SQL databases alive with a virtualization layer in between. And
like all work arounds there are serious drawbacks.

Holistically seen it makes absolutely no sense to compile LINQ
expression into SQL just for the purpose to pass it to another compiler
witch transform the SQL back into an execution tree. As soon as there is
a reasonable standard to communicate with execution trees, SQL is obsolete.

Probably noSQL databases will outperform SQL sooner or later and leave
the SQL databases as legacy products behind. But for now there are still
open issues e.g. regarding maintainability, backup etc.
So our provider still uses an SQL backend too. But except for some meta
infos the data is just BLOB or XML. So a transactional file system would
do the job as well.


Marcel
Jake Bluford
2015-11-30 20:29:34 UTC
Permalink
On Mon, 30 Nov 2015 20:29:25 +0100, Marcel Mueller
Post by Marcel Mueller
Post by Jake Bluford
Just a quick poll, how many of you are using Entity Framework for your
data access?
Sometimes. It is useful to access simple data models that map easily to
.NET POCOs.
Formerly I argued that code first is the best way. Now I see that when I
can use code first, then the object model does not fit well into a
relational DB model anyway and so EF is no much help.
On the other side when I have to access legacy data models. DB first is
sometimes useful.
Post by Jake Bluford
If not EF, what do you use (straight ADO.Net, nHibernate etc?)
Custom LINQ provider with in memory computation, deduplication and
without 1:1 mapping to SQL types (too complicated, no benefit).
From my point of view SQL is outdated. ORM is just a work around, to
keep SQL databases alive with a virtualization layer in between. And
like all work arounds there are serious drawbacks.
Holistically seen it makes absolutely no sense to compile LINQ
expression into SQL just for the purpose to pass it to another compiler
witch transform the SQL back into an execution tree. As soon as there is
a reasonable standard to communicate with execution trees, SQL is obsolete.
Probably noSQL databases will outperform SQL sooner or later and leave
the SQL databases as legacy products behind. But for now there are still
open issues e.g. regarding maintainability, backup etc.
So our provider still uses an SQL backend too. But except for some meta
infos the data is just BLOB or XML. So a transactional file system would
do the job as well.
Based on your last paragraph, I'm curious what types of applications
you typically work on (where the data is mostly just a BLOB or XML)?

So many of the applications I've been involved with rely heavily on
relational data, where it seems to me noSQL is basically like a
content repository. Admittedly my noSQL exposure has been very
limited, so maybe I'm wrong about that. But given what I know about
it, I have trouble envisioning it as a replacement for an RDBMS.

Now, folks that don't work with line-of-business apps that depend
fundamentally on relational data models, I can understand the desire
to not deal with the complexity of SQL. But as of yet I don't see how
noSQL can ever really replace it.
Marcel Mueller
2015-11-30 21:25:23 UTC
Permalink
Post by Jake Bluford
Post by Marcel Mueller
Probably noSQL databases will outperform SQL sooner or later and leave
the SQL databases as legacy products behind. But for now there are still
open issues e.g. regarding maintainability, backup etc.
So our provider still uses an SQL backend too. But except for some meta
infos the data is just BLOB or XML. So a transactional file system would
do the job as well.
Based on your last paragraph, I'm curious what types of applications
you typically work on (where the data is mostly just a BLOB or XML)?
Business applications. Promotion data, plant data, purchase orders and
so on. Basically everything which would fit into a document data base.
Post by Jake Bluford
So many of the applications I've been involved with rely heavily on
relational data, where it seems to me noSQL is basically like a
content repository.
Well, if an application was designed to work with relational data then a
relational database will perfectly fit, of course.

But if I start from the requirements, build a matching object model then
it will be probably not happen to be relational. The objects will more
likely correspond to business objects like a purchase order or (rather
complex) article data with many sub entities. From the point of an OO
language it makes absolutely no sense to blow up the entire data model
with dozens of flat entities with foreign keys to their header and maybe
other sub entities in between.
Post by Jake Bluford
Admittedly my noSQL exposure has been very
limited, so maybe I'm wrong about that.
Well, I started the first document based data base in the 80's on
Motorola 68k architecture, although I did not know anything about noSQL
at this time, and nothing about SQL either. It was just the natural
solution to the problem that came into my mind. I was able to provide
search as you type without any notable latency (thanks to trie structures).
Post by Jake Bluford
But given what I know about
it, I have trouble envisioning it as a replacement for an RDBMS.
It will not /replace/ a RDBMS. It is another solution for other tasks.
It makes as less sense to put object data into a RDBMS than the other
way around.
Post by Jake Bluford
Now, folks that don't work with line-of-business apps that depend
fundamentally on relational data models, I can understand the desire
to not deal with the complexity of SQL. But as of yet I don't see how
noSQL can ever really replace it.
What are the requirements that need a relational data models?

From my point of view a relational data model is just a small subset of
object oriented data models where all object properties are restricted
to scalar types. An arbitrary restriction.


Marcel
Jake Bluford
2015-11-30 22:40:11 UTC
Permalink
On Mon, 30 Nov 2015 22:25:23 +0100, Marcel Mueller
Post by Marcel Mueller
What are the requirements that need a relational data models?
Well just to think up a example that applies to many business
applications, imagine something like an account on an order that has a
contact.

Order #1234 was placed in the year 2010 by CompanyXYZ in Omaha
Nebraska, principal contact is John Smith, phone number is 555-6666.
That order was for five WidgetTypeA products.

Over the course of 10 years, the company changes their name a few
times, the principal contact changes to Frank Jones, the phone number
changes to 888-9999, and eventually they relocate all their office to
Los Angeles California in the year 2015, although they have placed
many orders they have not ordered a WidgetTypeA since the 2010 order
mentioned above.

Many other customers ordered WidgetTypeA. But unfortunately, it is
realized that WidgetTypeA is somehow defective, and needs to be
recalled as a matter of life or death, or perhaps it is deemed to
violate some state law that applies in California only.

How do we identify all orders for WidgetTypeA in California? Without
referential integrity that provides foreign key reference to the NEW
company as it exists in 2015, we have no real link to it, they used to
be in Nebraska. If the order detail information is buried in some
XML, JSON or similar document hierarchy, drilling down into each
document to run queries to find this kind of information (something
that customer support has to be able to do in real time on a minute by
minute basis) is extremely computationally painful with large amounts
of data, near impossible to keep indexed for query execution speed.

There were some other nightmare scenarios I was going to mention about
changing contacts and phone numbers, and what if the SKU# for
WidgetTypeA changes over time, etc. but I think you get the idea.

Bottom line is that RDBMS were designed to store data efficiently,
allow referential integrity (solving the problem above), facilitate
efficient indexing/queries, reduce redundancy of storage (not just for
the storage space cost but the efficiency of querying non-redundant
data).

Maybe my understanding of these other systems is wrong, but I think in
those cases all of the data is stored with the document. Fundamentally
it can never provide the benefits of storing a tiny key to the real
data, from query speed to being able to update in a single location
and have the update instantly applied everywhere (without having to
actually do an expensive update or write operation).
Marcel Mueller
2015-12-01 01:40:33 UTC
Permalink
Post by Jake Bluford
On Mon, 30 Nov 2015 22:25:23 +0100, Marcel Mueller
Post by Marcel Mueller
What are the requirements that need a relational data models?
Well just to think up a example that applies to many business
applications, imagine something like an account on an order that has a
contact.
Order #1234 was placed in the year 2010 by CompanyXYZ in Omaha
Nebraska, principal contact is John Smith, phone number is 555-6666.
That order was for five WidgetTypeA products.
If you order more than one item at once the model is no longer
relational. Now it contains a list of order items. Creating a new entity
for order items and repeating the order ID over and over for them is the
work around not part of the requirement.
Post by Jake Bluford
Many other customers ordered WidgetTypeA. But unfortunately, it is
realized that WidgetTypeA is somehow defective, and needs to be
recalled as a matter of life or death, or perhaps it is deemed to
violate some state law that applies in California only.
How do we identify all orders for WidgetTypeA in California?
Place a query to the orders with an appropriate .Where condition. E.g.
Query<Orders>().Where(o => o.Articles.Contains(WidgetTypeA) && State ==
California)
Post by Jake Bluford
Without
referential integrity that provides foreign key reference to the NEW
company as it exists in 2015, we have no real link to it, they used to
be in Nebraska.
Of course, companies, articles and orders are /different/ entities and
foreign keys are still allowed and required. But orders and order items
and maybe multiple schedules are parts of the same object.
Post by Jake Bluford
If the order detail information is buried in some
XML, JSON or similar document hierarchy, drilling down into each
document to run queries to find this kind of information (something
that customer support has to be able to do in real time on a minute by
minute basis) is extremely computationally painful with large amounts
of data, near impossible to keep indexed for query execution speed.
Even indexes are still allowed. In this case an index to the articles in
the Items array is a good advise. In fact a computed property Articles
might refer to Items.Select(i => i.Article) and the index might apply to
.Articles.
Of course, the index types are no longer that simple. You may have
comparable indexes which can access ranges but require the underlying
type to be comparable. This is basically the same than database indexes.
But you also may have hash indexes that only support equality; they can
be amazingly fast. You also may have array indexes that can speed up
.Contains or .Overlaps. In case of articles a hash index probably fits
best, since it usually makes no much sense to apply relational operators
on object IDs.
Post by Jake Bluford
There were some other nightmare scenarios I was going to mention about
changing contacts and phone numbers, and what if the SKU# for
WidgetTypeA changes over time, etc. but I think you get the idea.
In our case the data core engine has the complete history of all changes
so far. This could be cleaned up later, but for now it is complete. Of
course, this is just the option we took. Each time an object is saved a
new revision time stamp is created. This is part of the key. Objects
saved in the same transaction share the same time stamp; these are the
change sets. For each user request a snapshot of the entire database is
taken. This is quite simple, it is just DateTime.Now. The engine only
returns versions that have been valid at this time stamp. All later
changes are invisible to this request, and all current changes are
rejected if the do not modify the most recent version (optimistic locking).
Once saved, the object version becomes immutable. And since all requests
operate an a snapshot slightly in the past all they only access
immutable objects. Immutable objects are inherently thread safe. So all
objects in memory are shared between all parallel requests. This saves
much memory when you have concurrent access to the same data from
different users or parallel tasks. It also improves CPU cache
efficiency. And read only memory access is not sensitive to cache line
hopping on multi core hardware.
Furthermore the data can be compressed at different layers. First of
all, in memory objects and strings could be deduplicated. E.g. a foreign
key instance always takes only one machine size word in .NET (32 or 64
bit) regardless how complex it is. All references point to the same
multiton. The same applies to many attribute values. E.g. document
types, often stored as string, share the same string storage. This
implodes the memory footprint by about a factor 5 in our case.
The history also can be compressed. Usually only a few properties change
on save. So the persistent storage needs only to keep the most recent
version completely. The history entries could only contain delta frames
with backups of the overwritten properties. On deserialization the most
recent version has to be read first and then the delta frames need to be
applied. Since access to outdated version of object currently not in the
memory cache (from the previous write) is very unlikely this is no
significant impact. One could store "I frames" from time to time to
speed up access to old historical data. This saves about a factor 3 on
disk depending on the average number of changes per object.

In fact the data core turned out to be roughly 10 times faster than the
SAP R/3 system with Oracle backend. And it is about 3 orders of
magnitude faster than a php application on WAMP platform that has almost
the same, flexible data model than ours but stored in a transposed
table. By the way, the server is a dual core VM with AFAIR 8GB memory,
CPU load almost always <5%; the php machine in contrast has 16 cores and
16GB memory, several times a day all 16 cores are at 100%. The number of
users is the same since most of them use both applications at once.
Post by Jake Bluford
Bottom line is that RDBMS were designed to store data efficiently,
allow referential integrity (solving the problem above),
Referential integrity is not only a property of RDBMS.
Post by Jake Bluford
facilitate
efficient indexing/queries,
Well, this point is an issue. SQL optimizers are quite smart nowadays,
especially Oracle. (MSSQL a bit less.)
With the power of LINQ you could easily run into expression that are no
longer understood by the optimizer. This causes either bad runtime or an
exception in case of EF (I hate these exceptions).
Post by Jake Bluford
reduce redundancy of storage (not just for
the storage space cost but the efficiency of querying non-redundant
data).
This is another point for document databases. Oracle and MSSQL have 8k
pages. So basically almost all disk I/O is performed with 8k block size.
The SAN statistics show another peak at 64k (probably large pages), but
this one is at least one order of magnitude lower. Given a storage based
on 15k HDDs you get a throughput of about 200 IOPS at 1,6MB/s per
spindle - very bad; even my first 486 computer could read at 6MB/s from HDD.
A document based DB tends to read larger blocks, often including
unnecessary data. But the disks do not care as long as the average block
size is below the critical size of a few MB. The number of IOPS is
almost unchanged, but you do not need that many of them for the same data.
Only if you have fully flash based storage the situation changes. SSDs
can deal much better with small blocks.
Post by Jake Bluford
Maybe my understanding of these other systems is wrong, but I think in
those cases all of the data is stored with the document. Fundamentally
it can never provide the benefits of storing a tiny key to the real
data, from query speed to being able to update in a single location
and have the update instantly applied everywhere (without having to
actually do an expensive update or write operation).
There is no major difference with respect to this topics, as mentioned
above.


Marcel
Jake Bluford
2015-12-01 02:39:21 UTC
Permalink
On Tue, 01 Dec 2015 02:40:33 +0100, Marcel Mueller
Post by Marcel Mueller
Post by Jake Bluford
On Mon, 30 Nov 2015 22:25:23 +0100, Marcel Mueller
Post by Marcel Mueller
What are the requirements that need a relational data models?
Well just to think up a example that applies to many business
applications, imagine something like an account on an order that has a
contact.
Order #1234 was placed in the year 2010 by CompanyXYZ in Omaha
Nebraska, principal contact is John Smith, phone number is 555-6666.
That order was for five WidgetTypeA products.
If you order more than one item at once the model is no longer
relational. Now it contains a list of order items. Creating a new entity
for order items and repeating the order ID over and over for them is the
work around not part of the requirement.
Many systems would only need the primary key of the item and the
quantity of it for the order, that would still be relational. I can't
imagine storing all information about an item or part into the order
record itself, most of the things aside from key would typically
stored in another table. If something about that part/item turns out
to be wrong, it should be fixable in a single place, not on every
order where it was ever referenced. But that's just describing
possible implementations.
Post by Marcel Mueller
Post by Jake Bluford
Many other customers ordered WidgetTypeA. But unfortunately, it is
realized that WidgetTypeA is somehow defective, and needs to be
recalled as a matter of life or death, or perhaps it is deemed to
violate some state law that applies in California only.
How do we identify all orders for WidgetTypeA in California?
Place a query to the orders with an appropriate .Where condition. E.g.
Query<Orders>().Where(o => o.Articles.Contains(WidgetTypeA) && State ==
California)
But that requires you to pull all the orders from the database over
the network to wherever the LINQ query is executed. Imagine there are
only two records that meet that criteria, but there are millions of
orders in the database. With SQL, they are identified efficiently
through indexing and just enough data to hold two orders are sent over
the network to the application. Anything else needs to retrieve a
million records, then process them, painfully if the information is
deep in someone XML document for example.
Post by Marcel Mueller
Post by Jake Bluford
Without
referential integrity that provides foreign key reference to the NEW
company as it exists in 2015, we have no real link to it, they used to
be in Nebraska.
Of course, companies, articles and orders are /different/ entities and
foreign keys are still allowed and required. But orders and order items
and maybe multiple schedules are parts of the same object.
Not IMO, at least in a good design. Now, someone might say that an
INVOICE (which is like a historical representation of an order at one
point in time) might have a lot more data added to it, because for
example if WidgetTypeA had it's part name later changed to
WidgetTypeB, then it is very possible that the invoice would retain
the actual full name of the part (listing what it is for the
customer's convenience), and retaining that historical name. In that
case, the invoice is indeed a document, in the true sense, and the
kind of content that can lend itself to archiving, like a .PDF copy.
But, it is historical data that probably doesn't all aspects of the
document to be queryable, it's maybe just a BLOB sitting on the
customer's account record or something. Customer service would
probably search for the customer's account number first, before
retrieving the invoice.

But again there's a difference in the kind of order I was envisioning
(one that queries/analytics etc would be run against) and something
like an invoice document that describes the order.
Post by Marcel Mueller
Post by Jake Bluford
If the order detail information is buried in some
XML, JSON or similar document hierarchy, drilling down into each
document to run queries to find this kind of information (something
that customer support has to be able to do in real time on a minute by
minute basis) is extremely computationally painful with large amounts
of data, near impossible to keep indexed for query execution speed.
Even indexes are still allowed. In this case an index to the articles in
the Items array is a good advise. In fact a computed property Articles
might refer to Items.Select(i => i.Article) and the index might apply to
.Articles.
Of course, the index types are no longer that simple. You may have
comparable indexes which can access ranges but require the underlying
type to be comparable. This is basically the same than database indexes.
But you also may have hash indexes that only support equality; they can
be amazingly fast. You also may have array indexes that can speed up
.Contains or .Overlaps. In case of articles a hash index probably fits
best, since it usually makes no much sense to apply relational operators
on object IDs.
Post by Jake Bluford
There were some other nightmare scenarios I was going to mention about
changing contacts and phone numbers, and what if the SKU# for
WidgetTypeA changes over time, etc. but I think you get the idea.
In our case the data core engine has the complete history of all changes
so far. This could be cleaned up later, but for now it is complete. Of
course, this is just the option we took. Each time an object is saved a
new revision time stamp is created. This is part of the key. Objects
saved in the same transaction share the same time stamp; these are the
change sets. For each user request a snapshot of the entire database is
taken. This is quite simple, it is just DateTime.Now. The engine only
returns versions that have been valid at this time stamp. All later
changes are invisible to this request, and all current changes are
rejected if the do not modify the most recent version (optimistic locking).
Once saved, the object version becomes immutable. And since all requests
operate an a snapshot slightly in the past all they only access
immutable objects. Immutable objects are inherently thread safe. So all
objects in memory are shared between all parallel requests. This saves
much memory when you have concurrent access to the same data from
different users or parallel tasks. It also improves CPU cache
efficiency. And read only memory access is not sensitive to cache line
hopping on multi core hardware.
Furthermore the data can be compressed at different layers. First of
all, in memory objects and strings could be deduplicated. E.g. a foreign
key instance always takes only one machine size word in .NET (32 or 64
bit) regardless how complex it is. All references point to the same
multiton. The same applies to many attribute values. E.g. document
types, often stored as string, share the same string storage. This
implodes the memory footprint by about a factor 5 in our case.
The history also can be compressed. Usually only a few properties change
on save. So the persistent storage needs only to keep the most recent
version completely. The history entries could only contain delta frames
with backups of the overwritten properties. On deserialization the most
recent version has to be read first and then the delta frames need to be
applied. Since access to outdated version of object currently not in the
memory cache (from the previous write) is very unlikely this is no
significant impact. One could store "I frames" from time to time to
speed up access to old historical data. This saves about a factor 3 on
disk depending on the average number of changes per object.
In fact the data core turned out to be roughly 10 times faster than the
SAP R/3 system with Oracle backend. And it is about 3 orders of
magnitude faster than a php application on WAMP platform that has almost
the same, flexible data model than ours but stored in a transposed
table. By the way, the server is a dual core VM with AFAIR 8GB memory,
CPU load almost always <5%; the php machine in contrast has 16 cores and
16GB memory, several times a day all 16 cores are at 100%. The number of
users is the same since most of them use both applications at once.
Post by Jake Bluford
Bottom line is that RDBMS were designed to store data efficiently,
allow referential integrity (solving the problem above),
Referential integrity is not only a property of RDBMS.
Post by Jake Bluford
facilitate
efficient indexing/queries,
Well, this point is an issue. SQL optimizers are quite smart nowadays,
especially Oracle. (MSSQL a bit less.)
With the power of LINQ you could easily run into expression that are no
longer understood by the optimizer. This causes either bad runtime or an
exception in case of EF (I hate these exceptions).
Post by Jake Bluford
reduce redundancy of storage (not just for
the storage space cost but the efficiency of querying non-redundant
data).
This is another point for document databases. Oracle and MSSQL have 8k
pages. So basically almost all disk I/O is performed with 8k block size.
The SAN statistics show another peak at 64k (probably large pages), but
this one is at least one order of magnitude lower. Given a storage based
on 15k HDDs you get a throughput of about 200 IOPS at 1,6MB/s per
spindle - very bad; even my first 486 computer could read at 6MB/s from HDD.
A document based DB tends to read larger blocks, often including
unnecessary data. But the disks do not care as long as the average block
size is below the critical size of a few MB. The number of IOPS is
almost unchanged, but you do not need that many of them for the same data.
Only if you have fully flash based storage the situation changes. SSDs
can deal much better with small blocks.
Post by Jake Bluford
Maybe my understanding of these other systems is wrong, but I think in
those cases all of the data is stored with the document. Fundamentally
it can never provide the benefits of storing a tiny key to the real
data, from query speed to being able to update in a single location
and have the update instantly applied everywhere (without having to
actually do an expensive update or write operation).
There is no major difference with respect to this topics, as mentioned
above.
Well it's hard for me to gauge exactly what your engine is doing. I'm
sure a lot of the applications I've worked on could be re-imagined to
work with a non-relational/document type data mechanism. You
mentioned a VM, and I've never seen any RDBMS on a VM that wasn't
absolutely crippled by the performance inherent to virtualization
itself, so if you were comparing the speed of MSSQL/Oracle on a VM to
anything other than an RDBMS, I'm not surprised that you saw a vast
difference in speed.

Also it's possible that the scenario I described would never happen
with your app. But imagine if version snapshotting were the only way
that queries could get those millions of orders, and the data changed
regularly, now everytime that LINQ query is to run it has to pull all
the orders across the network, all the version history, then start
processing it.

I guess it just depends on the app, the infrastructure set up, number
of users, how they use it, etc. I have worked on apps where the
scenario I just described would probably be a non-issue, but I've
worked on others where I just can't imagine how long it would take
that sort of processing to execute.
Marcel Mueller
2015-12-01 09:00:51 UTC
Permalink
Post by Jake Bluford
Post by Marcel Mueller
Post by Jake Bluford
How do we identify all orders for WidgetTypeA in California?
Place a query to the orders with an appropriate .Where condition. E.g.
Query<Orders>().Where(o => o.Articles.Contains(WidgetTypeA) && State ==
California)
But that requires you to pull all the orders from the database over
the network to wherever the LINQ query is executed.
No. It is up to the LINQ provider to do whatever he likes.
So if there is an index on Order.Articles use it. The index will return
primary keys or object IDs that match. Only they have to be physically
read into memory. And if there is another index on Order.State it might
further intersect the keys returned by the first index with the ones
returned by the second index. Now no additional data has to be read at
all. Note that it is not necessary to have both properties in the /same/
index here.
Post by Jake Bluford
Imagine there are
only two records that meet that criteria, but there are millions of
orders in the database.
No problem.
Post by Jake Bluford
With SQL, they are identified efficiently
through indexing and just enough data to hold two orders are sent over
the network to the application.
ACK, ...
Post by Jake Bluford
Anything else needs to retrieve a
million records,
... but to claim that /every/ other solution is worse is daring.
Post by Jake Bluford
Post by Marcel Mueller
Of course, companies, articles and orders are /different/ entities and
foreign keys are still allowed and required. But orders and order items
and maybe multiple schedules are parts of the same object.
Not IMO, at least in a good design.
Probably you learned that normalization of the data model is always a
good advise. Well, of course, if your DBMS cannot reasonably deal with
anything else then this is true. Otherwise not.
Post by Jake Bluford
Now, someone might say that an
INVOICE (which is like a historical representation of an order at one
point in time) might have a lot more data added to it, because for
example if WidgetTypeA had it's part name later changed to
WidgetTypeB, then it is very possible that the invoice would retain
the actual full name of the part (listing what it is for the
customer's convenience), and retaining that historical name.
Again, this is one possible solution. No more no less.
Post by Jake Bluford
Well it's hard for me to gauge exactly what your engine is doing. I'm
sure a lot of the applications I've worked on could be re-imagined to
work with a non-relational/document type data mechanism.
I would not recommend to do a change like that to an existing
application. But if you develop a new one you could think twice before
you choose SQL. There are other options. The time of noSQL just has started.
Post by Jake Bluford
You
mentioned a VM, and I've never seen any RDBMS on a VM that wasn't
absolutely crippled by the performance inherent to virtualization
itself, so if you were comparing the speed of MSSQL/Oracle on a VM to
anything other than an RDBMS, I'm not surprised that you saw a vast
difference in speed.
These times are over now. Recent versions of MSSQL as well as Oracle are
explicitly permitted to run in a VM. And there is no significant
performance impact if we talk about the same physical hardware.
In fact our SAP/Oracle is still physical hardware (one of the last)
while our application runs on VMware as well as MSSQL.

But the app's today's I/O statistics show that from about 20.000 LINQ
queries and 80.000 primary key lookups only roughly 2000 ever reached
the DBMS. So the speed of the persistence layer is almost irrelevant.
With EF and RDBMS this would be 100.000 DBMS requests over the network.
Well, probably the programmers would not do primary key lookups that
carelessly in this case.
OK, the numbers are quite small. Today there have been only 10.000 user
requests so far. The application is more heavily used on mondays and
thursdays.
Post by Jake Bluford
Also it's possible that the scenario I described would never happen
with your app. But imagine if version snapshotting were the only way
that queries could get those millions of orders, and the data changed
regularly, now everytime that LINQ query is to run it has to pull all
the orders across the network, all the version history, then start
processing it.
You forgot that different queries always share the same objects in
memory. The objects are multitons. No more than one object per entity,
primary key (and revision) can coexist in memory. And of course, like
any other system there is a cache that does not immediately discard them
after they have been used. Since the memory for the properties is
deduplicated there is no much benefit from discarding them.
The concept is similar to SAP's flavor of the month, HANA. Although it
has been developed without any knowledge about HANA.
Post by Jake Bluford
I guess it just depends on the app, the infrastructure set up, number
of users, how they use it, etc.
Indeed. No solution is optimal for everything.
Post by Jake Bluford
I have worked on apps where the
scenario I just described would probably be a non-issue, but I've
worked on others where I just can't imagine how long it would take
that sort of processing to execute.
From my experience there are only a few business requirements that need
solutions with more than O(N log N). Traveling salesman is one of the
most famous ones.


Marcel
Jake Bluford
2015-12-02 21:07:14 UTC
Permalink
On Tue, 01 Dec 2015 10:00:51 +0100, Marcel Mueller
<***@spamgourmet.org> wrote:

Just to simplify things a bit (and partially because I believe I don't
have enough experience with document databases or NoSQL to properly
discuss some of the points here), I'm curious about something.

Let's say we have a simple relational model for a bug-tracking
database with only a few tables.

One table holds the bug itself (issue number, person who entered it
etc). It has a field called "priority" that determines how the issue
is handled (or at least how soon).

Another table holds priorities, for example the id, the displayed
name, like "VERY LOW", "LOW", "MEDIUM", "HIGH", "CRITICAL".

With each priority record are some other attributes that have
importance to other attributes on the main bug record. For example,
if the bug is for SoftwareProductA, these attributes say that a
CRITICAL issue needs attention in the next 5 hours, but if it's for
SoftwareProductB the issue can wait up to 48 hours of attention if
needed.

The data entry screen allows the user to select priority when entering
a bug into the Bug table, but basically just adds the foreign key of
the Priority entity record into the main Bug table, typical one to
many relationship. The attributes on the Priority table that impact
the service window mentioned above are defined in that table, because
they may need occasional adjustment. The admin needs to be able to
adjust these in one place, for example he might adjust the record that
holds the definition for CRITICAL to move from a value of 1.0 to .8
for example, which says that SoftwareProductA needs attention in 4
hours instead of 5, and likewise all of the other software products
that can have issues are impacted by this one change.

/* side note - I haven't talked about things like the table that
relates the product to the bug, or that assigns the possible issues to
the products because I'm trying to keep things simple, so I just
wanted to leave them out of this scenario and focus on Bug entity and
Priority entity, but in reality there might be another half dozen or
more tables just supporting the relative simple 1:M I have described
so far. */

If more data like the priority display name is placed into the main
Bug record, we have potential problems. What if there was a typo or
misspelling? It was spelled CRITKAL during the first few weeks then
later corrected. All those original records have the mispelled
version embedded in them, making future queries difficult / inaccurate
(loss of referential integrity). With enough diligence this could be
solved through inefficient updates, data cleansing efforts,
inefficient manual intervention, better training for human error, etc.
but those problems all seem very prehistoric and are the whole reason
relational databases were invented.

The same scenario could happen with the value that was changed from
1.0 to .8, what if this was embedded into each issue record? The
algorithm that calculates what happens for a given priority is now
potentially different amoung the bug records, maybe it's .8 on newer
records, but 1.0 on older records. What a mess!

Now I know you did say that foreign keys are allowed in your system
but in my mind, once you allow foreign keys to solve the problems I've
mentioned above, it is an RDBMS by definition, even if it takes on
some of the characteristics of other database types.

The kind of thing I mentioned above has been a part of almost every
application (on the back end I mean) that I have been involved with
over the last 25 or 30 years.

I'm just trying to understand how these problems might be addressed
without SQL and relational data.
Marcel Mueller
2015-12-02 21:56:28 UTC
Permalink
Post by Jake Bluford
Let's say we have a simple relational model for a bug-tracking
database with only a few tables.
If we have a relational model then, of course, a RDDMS fits best.
Post by Jake Bluford
One table holds the bug itself (issue number, person who entered it
etc). It has a field called "priority" that determines how the issue
is handled (or at least how soon).
Another table holds priorities, for example the id, the displayed
name, like "VERY LOW", "LOW", "MEDIUM", "HIGH", "CRITICAL".
With each priority record are some other attributes that have
importance to other attributes on the main bug record. For example,
if the bug is for SoftwareProductA, these attributes say that a
CRITICAL issue needs attention in the next 5 hours, but if it's for
SoftwareProductB the issue can wait up to 48 hours of attention if
needed.
The the priority properties should be part of the entity
SoftwareProduct. Probably even the list of properties is product
specific. So there is no strong entity Priority. It is only a weak
entity defined in the context of SoftwareProduct.
Or alternatively the priorities might be part of another entity SLA
which is part of a maintenance contract.

[...]
Again foreign key are still required.

And in fact I would not model a bug tracker with a document DB for
several reasons. First of all the data volume will probably never exceed
anything that does not fit into memory during a product life cycle,
regardless of the data model. Secondly this requirement is perfectly
serviced by a relational model.
Post by Jake Bluford
Now I know you did say that foreign keys are allowed in your system
but in my mind, once you allow foreign keys to solve the problems I've
mentioned above, it is an RDBMS by definition, even if it takes on
some of the characteristics of other database types.
OK, maybe I mixed up things a bit when I set SQL equivalent to RDBMS.
The problem are not the relations but the normalization of the model to
flat entities to fit the boundaries of SQL. (Well, MSSQL introduced some
XML support for complex data types, but the performance is like an ATARI
ST, it does not even support XSLT 1.0 and syntax drives you crazy.)
Post by Jake Bluford
I'm just trying to understand how these problems might be addressed
without SQL and relational data.
Without SQL: no problem, but you will need relations.

In our case we simply defined a generic struct Ref<T> to express foreign
keys to any entity T where ever you like.
A generic method T Get<T>(Ref<T>) of the Context class (basically the
same than an EF context) retrieves the matching object (usually from the
cache). One difference to EF is that there is no generated code. The
Context class does not know about the existing entities. It just
provides entry points like Get or IQueryable<T> Query<T>(). These
methods forward the requests to cache and persistence providers that
need to be set up by each entity.
In fact a base class does the job since all database entities share the
same (generic) providers. But the concept also allows to access entities
from remote services like web services or SAP RFC or active directory or
whatever in the same way. So an entity can have a foreign key to e.g. an
AD user although no such table exists in the applications data model.
Even a query to a BI system with sales history or the statistics of the
applications index usage are just entities. They might not have the tag
IQueryableEntity so the Query method is not allowed for them, since it
makes no much sense to query queries.


Marcel
Jake Bluford
2015-12-03 00:17:21 UTC
Permalink
On Wed, 02 Dec 2015 22:56:28 +0100, Marcel Mueller
Post by Marcel Mueller
Post by Jake Bluford
Let's say we have a simple relational model for a bug-tracking
database with only a few tables.
If we have a relational model then, of course, a RDDMS fits best.
Post by Jake Bluford
One table holds the bug itself (issue number, person who entered it
etc). It has a field called "priority" that determines how the issue
is handled (or at least how soon).
Another table holds priorities, for example the id, the displayed
name, like "VERY LOW", "LOW", "MEDIUM", "HIGH", "CRITICAL".
With each priority record are some other attributes that have
importance to other attributes on the main bug record. For example,
if the bug is for SoftwareProductA, these attributes say that a
CRITICAL issue needs attention in the next 5 hours, but if it's for
SoftwareProductB the issue can wait up to 48 hours of attention if
needed.
The the priority properties should be part of the entity
SoftwareProduct.
Probably even the list of properties is product
specific. So there is no strong entity Priority. It is only a weak
entity defined in the context of SoftwareProduct.
Or alternatively the priorities might be part of another entity SLA
which is part of a maintenance contract.
Definitely not, this would violate normalization rules of a relational
model. We could get away with violating normalization for the simple
example I've given because I only listed 5 possible priorities, but in
most systems you cannot guarantee there can ever be more than 5
priorities, so if you start adding those properties as fields to an
entity like the product, you have to start adding a new field to that
entity every time a new possible value for priority comes along!
Raymond Boydce and Edward Codd will roll over in their grave with a
design flaw like this ;)

Being a bit more serious for a moment, the fact that that is even a
remote solution is an unfortunate side effect of the fact I chose
priorities, which typically has a fairly limited number of values in a
bug system. No system should have a hard limit of 5 possible values
(Attribute1, Attribute2, etc in a table) when the truth is it needs to
be flexible enough to support N values without penalty.
Post by Marcel Mueller
[...]
Again foreign key are still required.
And in fact I would not model a bug tracker with a document DB for
several reasons. First of all the data volume will probably never exceed
anything that does not fit into memory during a product life cycle,
regardless of the data model. Secondly this requirement is perfectly
serviced by a relational model.
Most of the systems I've worked on would have been in bad shape if I
made not-to-exceed assumptions. I once worked on an app that offered
a hosted issue tracking application (sort of like a CRM-like
functionality, not software issues only but company "bugs") that
served a large customer base, it had people both inside the
organization putting data in. The number of concurrent user
connections actively using the system (that is users entering data and
running reports) was in the thousands at a busy time. I don't
remember the exact data volume but it was quite high.
Post by Marcel Mueller
Post by Jake Bluford
Now I know you did say that foreign keys are allowed in your system
but in my mind, once you allow foreign keys to solve the problems I've
mentioned above, it is an RDBMS by definition, even if it takes on
some of the characteristics of other database types.
OK, maybe I mixed up things a bit when I set SQL equivalent to RDBMS.
The problem are not the relations but the normalization of the model to
flat entities to fit the boundaries of SQL. (Well, MSSQL introduced some
XML support for complex data types, but the performance is like an ATARI
ST, it does not even support XSLT 1.0 and syntax drives you crazy.)
Yes, I agree that if the whole system is based on XML documents
(something that would be a bad choice for most of the systems I've
worked on... XML is usually a side-feature rather than a primary DB
choice), then SQL or RDBMS in general is not going to be the best
choice.

To me storing things in XML is a good choice when the data itself is
truly treated like a document throughout the system. For example
something like a FAQ document, where the question and answer values
would be very different for every record. I wouldn't want to have
just a large entity table, with a field for question and a field for
answer, because how many times would I ever need to query for a
particular question? And if I did, it's find to just search/traverse
the document itself and incur the performance hit.
Post by Marcel Mueller
Post by Jake Bluford
I'm just trying to understand how these problems might be addressed
without SQL and relational data.
Without SQL: no problem, but you will need relations.
In our case we simply defined a generic struct Ref<T> to express foreign
keys to any entity T where ever you like.
A generic method T Get<T>(Ref<T>) of the Context class (basically the
same than an EF context) retrieves the matching object (usually from the
cache).
So in your case you always have most of the data in memory? I can see
this working well if you don't have lots of users constantly updating
the data and the need to retrieve that info real time, otherwise you
could constantly need to look for changes to the central store, update
your model in memory, etc.
Marcel Mueller
2015-12-05 09:57:13 UTC
Permalink
Post by Jake Bluford
Post by Marcel Mueller
The the priority properties should be part of the entity
SoftwareProduct.
Probably even the list of properties is product
specific. So there is no strong entity Priority. It is only a weak
entity defined in the context of SoftwareProduct.
Or alternatively the priorities might be part of another entity SLA
which is part of a maintenance contract.
Definitely not, this would violate normalization rules of a relational
model. We could get away with violating normalization for the simple
example I've given because I only listed 5 possible priorities, but in
most systems you cannot guarantee there can ever be more than 5
priorities, so if you start adding those properties as fields to an
entity like the product, you have to start adding a new field to that
entity every time a new possible value for priority comes along!
Why should the value be restricted to 5 levels?

Ah, you mean one field per level? Of course not.
The field name is Priorities and its type is an array with elements that
have some key which is unique only within one array instance. Basically
a weak entity. The foreign key is then SoftwareProduct (or whatever the
parent is) and Priority. A generalized foreign key denotes a path in the
object tree of the other entity.

(Btw: I have seen an ugly model like this at SAP SCM for procurement
cycles. They created one field for each day of week. Although the
numbers of days per week is not that likely to chance the concept failed
badly when the procurement cycle do not repeat within one week.)
Post by Jake Bluford
Most of the systems I've worked on would have been in bad shape if I
made not-to-exceed assumptions. I once worked on an app that offered
a hosted issue tracking application (sort of like a CRM-like
functionality, not software issues only but company "bugs") that
served a large customer base, it had people both inside the
organization putting data in. The number of concurrent user
connections actively using the system (that is users entering data and
running reports) was in the thousands at a busy time. I don't
remember the exact data volume but it was quite high.
High volume could be an indication for in memory solutions. But a
reasonably large DB cache will probably do the job as well in this case.
Post by Jake Bluford
Post by Marcel Mueller
Post by Jake Bluford
I'm just trying to understand how these problems might be addressed
without SQL and relational data.
Without SQL: no problem, but you will need relations.
In our case we simply defined a generic struct Ref<T> to express foreign
keys to any entity T where ever you like.
A generic method T Get<T>(Ref<T>) of the Context class (basically the
same than an EF context) retrieves the matching object (usually from the
cache).
So in your case you always have most of the data in memory?
Yes, most of the /often used/ data. This keeps the UI responsive.
And the idea is to put as much data as possible into a given amount of
memory. So sharing memory between user requests and even different
objects is part of the solution. That's the concept. The typical
compression ratio for larger amounts of data is about 10:1. So you can
keep about 20 GB data within 2 GB of memory. The effective cache size is
20 GB in this case.
You can scale this easily up to 200 GB effective cache size. This is
more than the typical working set size of many applications.
I would not use this concept for larger applications let's say with 100
GB physical memory for now because I am in doubt whether the memory
management of the .NET runtime is reliable within this dimensions.
(probably not)
Post by Jake Bluford
I can see
this working well if you don't have lots of users constantly updating
the data and the need to retrieve that info real time, otherwise you
could constantly need to look for changes to the central store, update
your model in memory, etc.
No, there is no lookup to the central store in case of updates. If you
want to attach more core instances to one persistence layer you need
synchronization channels similar to the concept of cache coherence in
multi core CPUs. But because of the optimistic locking in all layers
there are no write locks in the cache and no cache hopping effects. The
only valid operation is to replace version X of object Y by version Z
atomically. This causes an invalidation broadcast to all instances for
the key of Y and the new time stamp Z. If the objects exist in another
cache too then it will be invalidated for requests behind the time stamp Z.

I tried to implement this feature with SQL message broker first. But the
implementation of message broker and the .NET subscriptions is so awful
and unreliable that it was useless. Probably the sales department
requested the feature with no reasonable time window and the programmers
stopped implementing it once the presentation did no longer crash
immediately.

However, because of the high efficiency there is no need for a second
engine for now. We only plan to set up a second engine for availability
reasons. The idea is to do background updates this way:
- Put one instance in maintenance mode and service all requests by the
other instance,
- then update the first instance,
- fill the cache with some commonly used data,
- switch maintenance mode to the other instance and
- repeat the procedure for the second instance.
Normally no user will notice this because the requests are sessionless.


Marcel

Continue reading on narkive:
Loading...