OLTP - online transaction processing systems. OLTP and OLAP technologies

Today, among the tools offered by the information technology market for processing and visualizing data for making management decisions, OLTP and OLAP technologies are most suitable. OLTP technology is focused on operational data processing, and the more modern OLAP technology is focused on interactive data analysis. Systems developed on their basis make it possible to achieve an understanding of the processes occurring at a management facility through prompt access to various data slices (representations of the contents of databases, organized to reflect various aspects of the enterprise’s activities). In particular, by providing a graphical representation of data, OLAP is able to make processing results data easy for perception.

OLTP (Online Transaction Processing) - real-time transaction processing. A method of organizing a database in which the system works with small-sized transactions, but with a large flow, and at the same time the client requires the fastest possible response time from the system.

In modern DBMSs, transaction serialization is organized through a locking mechanism, i.e. During the execution of a transaction, the DBMS locks the database or part of it accessed by the transaction; the lock is maintained until the transaction is committed. If in progress parallel processing When another transaction attempts to access the locked data, transaction processing is suspended and resumed only after the transaction that locked the data completes and the lock is released. The smaller the object being blocked, the greater the efficiency of the database. A transaction that updates data across multiple network nodes is called DISTRIBUTED. If a transaction works with a database located on one node, then it is called LOCAL. From the user's point of view, local and distributed transactions should be processed in the same way, i.e. The DBMS must organize the process of executing transaction distribution so that all local transactions included in it are synchronously committed on all nodes affected by them distributed system. In this case, a distributed transaction should be committed only if all its constituent local transactions are committed, and if at least one of the local transactions is interrupted, the entire distributed transaction must be interrupted. To implement these requirements in practice, the DBMS uses a two-stage transaction commit mechanism.

1. The database server that commits a distributed transaction sends the “Prepare to commit” command to all network nodes registered to perform transactions. If at least one of the servers does not respond about readiness, then the distributed database server rolls back the local transaction on all nodes.

2. All local DBMSs are ready for committing, i.e. the server processes the distributed transaction, finishes committing it, sending a command to commit the transaction to all local servers.

OLAP (English online analytical processing, analytical processing real-time) - information processing technology, including the compilation and dynamic publication of reports and documents. Used by analysts to quickly process complex database queries. Serves for preparing business reports on sales, marketing, management purposes, the so-called. data mining - data mining (a method of analyzing information in a database in order to find anomalies and trends without finding out the semantic meaning of the records).

OLAP takes a snapshot of a relational database and structures it into spatial model for inquiries. The stated processing time for queries in OLAP is about 0.1% of similar queries in a relational database.

An OLAP structure created from operational data is called an OLAP cube. A cube is created by joining tables using a star schema or a snowflake schema. At the center of the star schema is a fact table, which contains the key facts on which queries are made. Multiple dimension tables are joined to a fact table. These tables show how aggregated relational data can be analyzed. The number of possible aggregations is determined by the number of ways in which the original data can be hierarchically displayed.

For example, all clients can be grouped by city or by region of the country (West, East, North, etc.), so 50 cities, 8 regions and 2 countries will make up 3 levels of hierarchy with 60 members. Also, customers can be united in relation to products; if there are 250 products in 2 categories, 3 product groups and 3 production divisions, then the number of units will be 16560. When adding dimensions to the diagram, the number possible options quickly reaches tens of millions or more.

An OLAP cube contains basic data and information about dimensions (aggregates). The cube potentially contains all the information that might be needed to answer any queries. Due to the huge number of units, often a full calculation occurs only for some measurements, while for the rest it is performed “on demand”.

The challenge in using OLAP is creating queries, selecting reference data, and developing a schema, resulting in most modern products OLAP comes with a huge number of pre-configured queries. Another problem is in the underlying data. They must be complete and consistent

The first product to perform OLAP queries was Express (IRI). However, the term OLAP itself was coined by Edgar Codd, “the father of relational databases.” And Codd's work was funded by Arbor, a company that had released its own OLAP product, Essbase (later acquired by Hyperion, which was acquired by Oracle in 2007) the year before.

Other well-known OLAP products include Microsoft Analysis Services (formerly called OLAP Services, part of SQL Server), Oracle OLAP Option, IBM's DB2 OLAP Server (essentially EssBase with additions from IBM), SAP BW, SAS OLAP Server, Brio products, BusinessObjects, Cognos, MicroStrategy and other manufacturers.

OLAP is most commonly used in business planning and data warehouse products.

OLAP uses a multidimensional representation of aggregated data to provide quick access to strategically important information for in-depth analysis. OLAP applications must have the following basic properties:

multidimensional data representation;
support for complex calculations;
correct consideration of the time factor.

Advantages of OLAP:

increasing the productivity of production personnel, developers application programs. Timely access to strategic information.
providing users with sufficient opportunities to contribute own changes into the diagram.
OLAP applications rely on data warehouses and OLTP systems to provide up-to-date data, thereby maintaining control over the integrity of corporate data.
reducing the load on OLTP systems and data warehouses.

OLAP	OLTP
The data warehouse should include both internal corporate data and external data	The main source of information entering the operational database is the activities of the corporation, and data analysis requires the involvement of external sources information (for example, statistical reports)
The volume of analytical databases is at least an order of magnitude larger than the volume of operational ones. To conduct reliable analysis and forecasting in a data warehouse, you need to have information about the corporation’s activities and market conditions over several years	For prompt processing, data for the last few months is required
The data warehouse must contain uniformly presented and consistent information that is as close as possible to the content of operational databases. A component is needed to extract and “clean” information from different sources. In many large corporations, several operational information systems with their own databases simultaneously exist (for historical reasons).	Operational databases may contain semantically equivalent information presented in different formats, with different indications of the time of its arrival, sometimes even contradictory
The set of queries to an analytical database cannot be predicted. Data warehouses exist to answer ad hoc queries from analysts. You can only count on the fact that requests will not come too often and will involve large amounts of information. The size of the analytical database encourages the use of queries with aggregates (sum, minimum, maximum, average, etc.)	Data processing systems are created to solve specific problems. Information from the database is selected frequently and in small portions. Usually the set of queries to the operational database is known already during design
When the variability of analytical databases is low (only when loading data), the ordering of arrays turns out to be reasonable, more quick methods indexing for mass sampling, storing pre-aggregated data	Data processing systems by their nature are highly variable, which is taken into account in the DBMS used (normalized database structure, rows stored out of order, B-trees for indexing, transactional)
Analytical database information is so critical for a corporation that greater granularity of protection is required (individual access rights to certain rows and/or columns of the table)	For data processing systems, information protection at the table level is usually sufficient.

The objectives of the OLTP system are the rapid collection and most optimal placement of information in the database, as well as ensuring its completeness, relevance and consistency. However, such systems are not designed for the most efficient, fast and multidimensional analysis.

Of course, it is possible to build reports based on the collected data, but this requires the business analyst either to constantly interact with an IT specialist, or to have special training in the field of programming and computer technology.

What does the traditional decision-making process look like in a Russian company using an information system built on OLTP technology?

The manager gives the task to the specialist information department in accordance with your understanding of the issue. The information department specialist, having understood the task in his own way, builds a request to the operational system, receives an electronic report and brings it to the attention of the manager. This scheme for making critical decisions has the following significant disadvantages:

a negligible amount of data is used;
the process takes long time, since drawing up requests and interpreting an electronic report are rather tedious operations, while the manager may need to make a decision immediately;
The cycle must be repeated if it is necessary to clarify the data or consider the data from a different perspective, as well as if additional questions arise. Moreover, this slow cycle has to be repeated and, as a rule, several times, while even more time is spent on data analysis;
negatively affects the difference in professional training and areas of activity of a specialist in information technology and a leader. Often they think in different categories and, as a result, do not understand each other;
an unfavorable effect is caused by such a factor as the complexity of electronic reports for perception. The manager does not have time to select the numbers of interest from the report, especially since there may be too many of them. It is clear that the work of preparing data most often falls on specialists in information departments. As a result, a competent specialist is distracted by routine and ineffective work of compiling tables, diagrams, etc., which, naturally, does not contribute to improving his qualifications.

There is only one way out of this situation, and it was formulated by Bill Gates in the form of the expression: “Information at your fingertips.” Initial information must be available to its direct consumer – the analyst. It is directly accessible. And the task of the information department employees is to create a system for collecting, accumulating, storing, protecting information and ensuring its availability to analysts.

The global industry has long been familiar with this problem, and for almost 30 years there have been OLAP technologies that are designed specifically to enable business analysts to operate with accumulated data and directly participate in their analysis. Such analytical systems are the opposite of OLTP systems in the sense that they eliminate information redundancy (“collapse” information). At the same time, it is obvious that it is the redundancy of primary information that determines the effectiveness of the analysis. DSS, combining these technologies, makes it possible to solve a number of problems:

Analytical tasks: calculation of given indicators and statistical characteristics business processes based on retrospective information located in data warehouses.
Data visualization: presentation of all available information in user-friendly graphical and tabular form.
Obtaining new knowledge: determining the relationship and interdependence of business processes based on existing information(testing statistical hypotheses, clustering, finding associations and temporal patterns).
Simulation tasks: mathematical modeling behavior complex systems for an arbitrary period of time. In other words, these are tasks related to the need to answer the question: “What will happen if...?”
Control synthesis: determination of acceptable control actions that ensure the achievement of a given goal.
Optimization problems: integration of simulation, management, optimization and statistical methods of modeling and forecasting.

Enterprise managers using OLAP technology tools, even without special training, can independently and quickly obtain all the information necessary for studying business patterns, and in a wide variety of combinations and sections of business analysis. A business analyst has the opportunity to see in front of him a list of measurements and indicators of a business system. With such simple interface an analyst can build any reports, rearrange measurements (say, make cross-tabs - superimpose one measurement on another). In addition, he gets the opportunity to create his own functions based on existing indicators, conduct “what if” analysis - get results by specifying dependencies of any indicators of business functions or a business function on indicators. In this case, the maximum response of any report does not exceed 5 seconds.

OLTP and OLAP systems

In the previous subsection, it was noted that for an adequate representation of the subject area, ease of development and maintenance of the database, the relations must be reduced to the third normal form (there are forms of normalization of higher orders, but in practice they are used quite rarely), that is, be highly normalized. However, weakly normalized relationships also have their advantages, the main one of which is that if the database is accessed mainly only with queries, and modifications and additions of data are carried out very rarely, then their sampling is much faster. This is explained by the fact that in weakly normalized relations their connection has already been made and processor time is not wasted on this. There are two classes of systems for which strongly and weakly normalized relations are more suitable.

Highly normalized data models are well suited for OLTP applications – On - Line Transaction Processing (OLTP) – online transaction processing applications. Typical examples of OLTP applications are systems warehouse accounting, ticket orders, operational banking systems and others. The main function of such systems is to perform large quantity short transactions. The transactions themselves are quite simple, but the problems are that there are a lot of such transactions, they are executed simultaneously, and if errors occur, the transaction must be rolled back and return the system to the state it was in before the transaction began. Almost all database queries in OLTP applications consist of insert, update, and delete commands. Selection queries are mainly intended to provide users with a selection of data from various types of directories. Thus, most of the requests are known in advance at the system design stage. Critical to OLTP applications is the speed and reliability of short data update operations. The higher the level of data normalization in OLTP applications, the faster and more reliable it is. Deviations from this rule can occur when, already at the development stage, some frequently occurring queries are known that require connecting relationships and the speed of execution of which significantly affects the operation of applications.

Another type of application is OLAP application – On - Line Analytical Processing (OLAP) – applications for online analytical data processing. This is a generalized term that characterizes the principles of building decision support systems - Decision Support System (DSS), data warehouses - Data Warehouse, data mining systems - Data Mining. Such systems are designed to find dependencies between data, to conduct dynamic analysis based on the “what if...” principle and similar tasks. OLAP applications operate with large amounts of data accumulated in the enterprise or taken from other sources. Such systems are characterized by the following features:

new data is added to the system relatively rarely in large blocks, for example, once a month or quarter; Data added to the system is, as a rule, never deleted;

Before loading, the data undergoes various preparatory procedures related to bringing them to certain formats and the like; requests to the system are unregulated and quite complex; The speed of query execution is important, but not critical.

Bases OLAP data-applications are usually presented in the form of one or more hypercubes, the dimensions of which represent reference data, and the cells of the hypercube itself store the values of this data. Physically, a hypercube can be built on the basis of a special multidimensional data model - Multidimensional OLAP (MOLAP) or represented by means of a relational data model - Relational OLAP (ROLAP).

In OLAP systems using relational model data, it is advisable to store the data in the form of weakly normalized relations containing pre-calculated main totals. Data redundancy and related problems are not a problem here, since they are updated quite rarely and, along with the data update, the results are recalculated.

The characteristics and range of tasks effectively solved by each technology are illustrated in the following comparative table:

Characteristic	OLTP	OLAP
Purpose of the system	Registration, operational search and transaction processing, regulated analysis	Working with historical data, analytical processing, forecasting, modeling
Stored data	Operational, detailed	Covering a large period of time, aggregated
Data type	Structured	Various types
"Age" of data	Current (several months)	Historical (over the years) and projected
Data update rate	High, in small portions	Small, in large portions
Data aggregation level	Detailed data	Mainly aggregated data
Predominant operations	Data entry, search, update	Data Analysis
How to use data	Predictable	Unpredictable
	Transaction level	Database-wide
Type of activity	Operational, tactical	Analytical, strategic
Priorities		Flexibility User autonomy
	A large number of executive employees	Relatively small number of management employees

Comparison of OLTP and OLAP

Characteristic	OLTP	OLAP

Nature of requests	Lots of simple transactions	Complex transactions
Stored data	Operational, detailed	Covering a large period of time, aggregated
Type of activity	Operational, tactical	Analytical, strategic
Data type	Structured	Various types

System characteristic	Accounting system (OLTP)	OLAP
User interaction	Transaction level	Database-wide
Data used when the user accesses the system	Individual entries	Groups of records
Response time	Seconds	From a few seconds to a few minutes
Hardware resource usage	Stable	Dynamic
Nature of data	Mainly primary (most low level details)	Mostly derivatives (aggregate values)
Database access nature	Predefined or static access paths and data relationships	Undefined or dynamic access paths and data relationships
Data variability	High (data is updated with every transaction)	Low (data is rarely updated during the request)
Priorities	High performance High Availability	Flexibility User autonomy

Flaws
OLTP systems are optimized for small, discrete transactions. But requests for some complex information (for example, quarterly dynamics of sales volumes for a certain product model in a certain branch), typical for analytical applications (OLAP), will generate complex table joins and viewing of entire tables. One such request will take a lot of time and computer resources, which will slow down the processing of ongoing transactions.

Transaction is a group of sequential operations that represents a logical unit of working with data. A transaction can be executed either entirely and successfully, maintaining data integrity and independently of other transactions running in parallel, or it can not be executed at all and then it should not have any effect. Transactions are processed by transaction systems, during the operation of which a transaction history is created.

There are sequential (regular), parallel and distributed transactions. Distributed transactions involve the use of more than one transaction system and require much more complex logic (for example, two-phase commit - a two-phase transaction commit protocol). Also, some systems implement autonomous transactions, or sub-transactions, which are an autonomous part of the parent transaction.

Example: It is necessary to transfer an amount of 10 monetary units from bank account number 5 to account number 7. This can be achieved, for example, by the following sequence of actions:
Start transaction
read the balance on account number 5
reduce balance by 10 monetary units
save the new account balance number 5
read the balance on account number 7
increase balance by 10 monetary units
save the new balance of account number 7

End transaction
These actions represent the logical unit of work “transferring an amount between accounts”, and thus constitute a transaction. If you interrupt a given transaction, for example, in the middle, and do not cancel all changes, it is easy to leave the owner of account number 5 without 10 units, while the owner of account number 7 does not receive them.

Online transaction processing mode OLTP

The OLTP (On-Line Transaction Processing) mode of online transaction processing is used in organizational management information systems to reflect the current state of the subject area at any time, and batch processing occupies a very limited niche.
OLTP

Typically, the analytical capabilities of OLTP systems are very limited, they are used to facilitate the daily activities of the corporation, and are based on data that is relevant to the moment. OLTP class information systems are designed for collecting, registering, entering source data related to a particular subject area, primary data processing, storage, adequate visualization, search, issuing certificates and reporting materials. Primary processing includes checking the correctness of the input data and their compliance with integrity constraints, identification of the objects described by the data, coding, and transmission of data through horizontal and vertical connections. Data is entered into the information system either from a document that has a certain legal force, or directly from the place where the data originated. In the latter case, the document containing the entered data is printed by the system and given legal force.

In OLTP systems, target transactions are executed on target databases (for example, entering into a table a record with the parameters of an issued invoice, a posted invoice, or any other fact), which change the state of the database and bring them into line with the current state of the fragment of the real world that models the database. Thus, the main purpose of target databases is transaction processing.

Such systems are designed for input, structured storage and processing of information in real time. OLTP systems allow you to formulate queries like: how much, where, etc. Providing data from constantly synchronized (updated) databases, operating systems do not track the dynamics of changes in processes over large time periods, practically do not process the data (except for certain calculations) and, most importantly, do not form conclusions based on the available data, leaving this function to the person maker of the decision.

OLTP-applications cover a wide range of tasks in many industries - automation of accounting and warehouse accounting and document accounting, etc.

The main function of such systems is to simultaneously perform a large number of short transactions from a large number of users. The transactions themselves look relatively simple, for example, “withdraw an amount of money from account A, add this amount to account B.”

Information systems The OLTP class is characterized by the following features.
Characteristics of IS - information systems - OLTP class
-relative algorithmic simplicity,
-increased dynamics in terms of the nomenclature and structure of processed documents, which is associated with the close proximity of these systems to the subject area,
-mass and territorial distribution of places for collecting initial data,
-high requirements for the reliability and relevance of the entered data,
- massive scale, fairly frequent turnover and relatively low computer - qualifications of personnel (users).
- support for a large number of users;
-short response time to requests;
-relatively short queries;
-participation in queries of a small number of tables.

Historically such systems arose primarily because they fulfilled the needs for accounting, speed of service, data collection, etc. However, it soon became clear that data collection is not an end in itself and the accumulated data can be useful: information can be extracted from the data.
System development strategy
For a long time, the following was used as a strategy for developing such systems:
construction of separate workstations intended for processing groups of functionally related documents, and replication of ready-made workstations on site,
building full-featured parameterizable systems with replication and local customization. However, the systems obtained in this way had low adaptive capabilities to overcome the dynamics of subject areas. They placed high demands on operating personnel and required large maintenance overheads.
Relatively recently, a new, third strategy for developing information systems of the OLTP class began to be used. Its essence is as follows: they are not replicated ready-made systems, and some blanks and technological tools that allow you to quickly build/complete a system with the necessary functionality directly on site and then, using the same tool, modify it in accordance with the dynamics of the subject area.

Transactions are actions that are either performed completely or not performed at all. If a system disruption occurs during a transaction, the database is returned to its original state before the transaction (rollback). All completed transactions are recorded in the transaction log. A transaction is considered complete when a corresponding transaction entry appears in the journal.

OLTP technologies

In the practice of communicating with representatives of enterprise information services, one often encounters a serious misunderstanding of the differences in the capabilities, purpose and role of technologies designed for collecting information - OLTP systems (On-Line Transaction Processing) and information analysis technologies. Meanwhile, they are significantly different in functionality, and each of them is responsible for its own area in the information system.
OLTP system tasks– is the rapid collection and most optimal placement of information in the database, as well as ensuring its completeness, relevance and consistency. However, such systems are not designed for the most efficient, fast and multidimensional analysis.
Of course, it is possible to build reports based on the collected data, but this requires the business analyst either to constantly interact with an IT specialist, or to have special training in the field of programming and computer technology.
What does the traditional decision-making process look like in a Russian company using an information system built on OLTP technology?
The manager gives the task to the information department specialist in accordance with his understanding of the issue. The information department specialist, having understood the task in his own way, builds a request to the operational system, receives an electronic report and brings it to the attention of the manager. This scheme for making critical decisions has the following essential disadvantages:
-a negligible amount of data is used;
-the process takes a long time, since drawing up requests and interpreting an electronic report are rather tedious operations, while the manager may need to make a decision immediately;
-repetition of the cycle is required if it is necessary to clarify the data or consider the data in a different context, as well as if additional questions arise. Moreover, this slow cycle has to be repeated and, as a rule, several times, while even more time is spent on data analysis;
The difference in professional training and areas of activity of an information technology specialist and a manager has a negative impact. Often they think in different categories and, as a result, do not understand each other;
an unfavorable effect is caused by such a factor as the complexity of electronic reports for perception. The manager does not have time to select the numbers of interest from the report, especially since there may be too many of them. It is clear that the work of preparing data most often falls on specialists in information departments. As a result, a competent specialist is distracted by routine and ineffective work of compiling tables, diagrams, etc., which, naturally, does not contribute to improving his qualifications.
There is only one way out of this situation, and it was formulated by Bill Gates in the form of the expression: “Information at your fingertips.” Initial information must be available to its direct consumer – the analyst. It is directly accessible (!). And the task of the information department employees is to create a system for collecting, accumulating, storing, protecting information and ensuring its availability to analysts.

The scope of application is the sphere of payments, accounting, reservations, banks and stock exchange operations.

OLTP systems, being a highly effective means of implementing operational processing, turned out to be of little use for analytical processing tasks. This is caused by the following:
1. Using traditional OLTP systems, you can build an analytical report and even a forecast of any complexity, but regulated in advance. Any step aside, any unregulated requirement of the end user, as a rule, requires knowledge of the data structure and a fairly high qualification of the programmer;
2. many necessary for operating systems functionality is redundant for analytical tasks and at the same time may not reflect the subject area. Solving most analytical problems requires the use of external specialized tools for analysis, forecasting and modeling. The rigid structure of the databases does not allow achieving acceptable performance in the case of complex selections and sortings and, therefore, requires a lot of time to organize gateways.
3. unlike transactional ones, analytical systems do not require and, accordingly, do not provide for developed means of ensuring data integrity, their backup and recovery. This allows not only to simplify the implementation tools themselves, but also to reduce internal overhead and, therefore, improve performance when retrieving data.

The range of tasks effectively solved by each system will be determined based on the comparative characteristics of OLTP and OLAP systems

Data in OLTP systems is organized primarily to support transactions such as:

registration of an order entered from a cash register or via a Web site;

placing an order for components when their quantity in the warehouse becomes less than a certain number;

tracking components during assembly of the final product in production;

registration of information about employees;

Registering the identity of license holders, such as restaurant owners or drivers.

Individual transactions accessing a relatively small amount of data complete quickly. OLTP systems are designed and optimized to process hundreds or thousands of transactions simultaneously.

OLTP systems excel at capturing the data needed to support day-to-day operations. However, the data in them is organized differently than is necessary when the information is intended for managers to plan the work of their organizations. Managers often need summary information to analyze trends affecting their organization or group.

Modern Data Warehousing Challenges
Sharing data for specific purposes

The development of Data Warehouse technology began with the need to separate data used for operations and data used for analytical purposes. The repository provides capabilities best suited for reporting. Additionally, separating transactional and reporting users, whose ad hoc queries can negatively impact the efficiency of operational systems, ensures optimal use of data infrastructure resources.
Time value of data

And while Warehouses provide an organization with a great reporting and analysis platform, it typically falls short in real-time based on the age of the data available. Due to technological limitations, Storages are usually replenished at night using packet data transfers. To do this, a batch program is used that vertically reads the entire database looking for changes. Data entering the Warehouse using this ETL approach is always out of date (usually a day).

As the volume of processed data grows, as well as the number and variety of data processing systems, the time and complexity of the process of filling the Storage increases. At the same time, globalization, the growing duration of operation of systems, limited agreements on after-sales service lead to the need to reduce batch operations. The combination of more data and competitive pressures creates significant challenges for an IT organization.

Decisions made based on yesterday's data are no longer satisfactory for most organizations. Real-time decision making requires real-time data that this places special demands on data integration for the Warehouse.

In addition, analytical operations performed in the Warehouse must be transferred again to the OLTP system from which the data came. In this way, analytical processing is centralized and decisions made on aggregated data in the Warehouse are transferred to the appropriate OLTP systems.

These trends are realized as follows:
Real-time data integration for Data Warehouse. Receiving and transmitting data in real time from operating systems to the Storage, making the data available for analysis.
Active Data Warehouse. Real-time data warehouse complemented by Business Intelligence tools to process and execute business decisions. Solutions are automatically transferred to OLTP systems. As a result, a closed processing cycle is formed.

In the quest to achieve real-time functionality of the Warehouse, success often depends on the wise choice of integration tool and approach to data acquisition, which provides the opportunity to improve the quality and timeliness of information.
Real-time Data Integration for Warehouse

To support real-time integration, the batch approach to retrieving operational data must be replaced by processes that continuously monitor the state of source systems, capture and transform changes in data as they occur, then load them into the Warehouse in as close to real time as possible. Continuous data collection allows you to analyze profit and price elements in any time frame. Trends can be analyzed at any chosen frequency and without delay.

ETL is an ideal solution to the problem of similar loading of large volumes of data into the Warehouse, and also provides extensive data transformation capabilities. However, ETL operations are typically performed while updating the source system is paused to ensure that the source is not modified when the data is received. This, in turn, leads to inconsistencies between OLTP systems and the Storage. As a result, data and applications are not always available to business users.

EAI solutions, previously designed for application integration, today often compete or coexist with ETL technologies, representing tools for integration and real-time data acquisition. EAI solutions transfer information between source and target systems, ensure data delivery, provide advanced flow support, and simplify key conversion elements.

However, EAI technology imposes limitations on volumes, since the original premise of this method was the integration of applications (not data), and its essence is to launch applications and transmit instructions and messages. However, the ability to move information in real time and maintain its integrity during the integration process in some cases makes EAI technology suitable for interchange between operating systems and active Storage.

Another approach to real-time data integration is transactional data management (TDM) technology, designed to receive, transmit, transform, deliver and verify transactional data in a heterogeneous environment in real time. TDM operates on completed transactions: selects them from OLTP systems apply basic transformation methods and transfer them to the Warehouse. By its architecture, the technology is asynchronous, but it provides synchronous behavior, works with a delay of a fraction of a second, maintaining the integrity of the data in the transaction.

EAI and TDM are designed to convey changes and updates to data, rather than complete data samples. Neither requires the source systems to be suspended because these technologies maintain the integrity of data manipulation language (DML) operations. This significantly reduces the amount of required data movement. And while ETL tools are primarily designed for the initial loading and transformation of data, EAI and TDM are more suitable for continuous data collection.

An increasing number of companies are using TDM technology to collect data for the Warehouse. TDM tools capture, route, deliver, and validate data operations across heterogeneous database environments with sub-second latency.

Transferring changed data at the transaction level allows the system to operate in active mode and process operations simultaneously with filling the Storage. In this case, the dependence of the batch processing interval is completely eliminated and the integrity of each transaction is preserved.

Integration of the Warehouse and OLTP system involves receiving and transmitting transactional data to the Warehouse simultaneously with the transfer of data on decisions made based on data warehouse data to one or more operational systems. This closed loop operation is also ensured by TDM tools.
Main characteristics and capabilities of integration tools

TDM integration tools have a number of important functional features.

Data collection

Data collection modules are installed on the source database and constantly monitor all newly received transactions. This is achieved by reading large amounts of data from transaction logs while transactions are still in progress and typically in memory. Data is processed at the transaction level, and only completed operations are sent to the Storage.

Data delivery

All new data is transferred to the intermediate storage area of the data warehouse, with a time delay of a fraction of a second. This means the most current data is always available for cutting-edge Business Intelligence techniques, reporting and decision-making. Since smaller samples of data are transmitted over a given period of time (than in the case of packet transmission), the additional load on the OLTP system is very small.

Heterogeneity

The data warehouse does not necessarily run within the same operating system or database as the OLTP system. In addition, situations often arise when you need to collect data from several operating systems and bases. Consequently, integration tools must support a wide range of DBMSs as well as platforms, simplifying the requirements for even the most heterogeneous IT infrastructures. This way, an organization can choose a platform based on corporate standards and preferences, and also develop with minimal impact on its finished data storage solution.

The data collected by the integration tool is converted into a platform- and DBMS-independent format. This maintains heterogeneity and eliminates the risk of data loss or corruption in the event of a source or target system outage.

Data selectivity

Integration tools transfer only the data that is required in the Storage. In a typical OLTP system, there are fields that are specific only to the application that the database serves. Not all of these parameters are needed in the Storage. The integration tool must provide identification of the columns that need to be extracted from the databases and transferred to the Storage.

Also, depending on the user’s criteria, certain rows can be selected from the source system database. For example, to separate data by geography or to select products that are specific to the target Warehouse.

Data Conversion

Selectivity in data transfer is important, but the challenge remains to transform, normalize, or denormalize the data, depending on the target system. Due to the different data models and object structures between the OLTP database and the Warehouse, the columns in the source system can be converted to match the columns in the target system. In some cases, it becomes necessary to merge several columns from different source rows into a single row and vice versa. For complex data transformations, exit points to the user program are proposed to implement any organization-specific rules for filling the data warehouse.

Flexibility

The ability to quickly and easily enable new database sources or target systems, including data capture and delivery processes, is important.

Dynamic table definition

In order not to interrupt the operation of the Storage, it is designed with the ability to quickly adapt to possible changes in the database. The definitions of the source and target tables change either with the advent of new software versions or with changes in Storage capacity requirements. Dynamic setting of table schemas is possible using parametric files. This way, you can make changes to source or target tables to quickly implement changes without upgrading software or making systems obsolete.

Feedback

Active Storage transfers data if certain conditions or rules are met. A complex operation might involve updating records in OLTP. For example, a fraud detection system can highlight suspicious transactions and change the status of a user's account in the Vault. This status change can be monitored by the integration tool and transmitted to the appropriate online transaction processing system. The return of information to an OLTP system is very significant for any closed-loop application, as well as for sending information simultaneously to reporting environments, data warehouses, backups, or other target systems.
Combining technologies

In the task of integrating DW and OLTP, it is possible to combine TDM and ETL processes. Including real-time data processing, continuous capture and retrieval of data at the transactional level. TDM tools can transfer data in real time to the intermediate storage layer of the target database, where the ETL server will intercept the data and, after applying transformations to it, load it into the Warehouse. This approach has disadvantages (including additional latency and the need to maintain an ETL server), but they are reasonable if the data transformation requirements are too complex.

The benefits are that new transactional data is immediately captured with very little performance impact on the OLTP system (compared to a conventional ETL process).
etc.............

Characteristics of an OLTP system Large volume of information Often different databases for different departments Normalized scheme, no duplication of information Intensive data changes Transactional mode of operation Transactions affect a small amount of data Processing current data - a snapshot Many clients Short response time - a few seconds Characteristics of an OLAP system Large amount of information Synchronized information from various databases using common classifiers Unnormalized database schema with duplicates Data changes rarely, Change occurs through batch loading Complex ad-hoc queries are performed on large volumes of data with extensive use of groupings and aggregate functions. Time dependency analysis Small number of working users – analysts and managers More time response (but still acceptable) – a few minutes

Codd's rules for relational databases 1. Information rule. 2. Guaranteed access rule. 3. Rule for supporting invalid values. 4. Dynamic directory rule based on the relational model. 5. Rule of exhaustive data sublanguage. 6. View update rule. 7. Rule for adding, updating and deleting. 8. Rule of independence of physical data. 9. Rule of independence of logical data. 10. Rule of independence of integrity conditions. 11. Rule of independence of distribution. 12. Rule of uniqueness.

Codd's Rules for OLAP 1. Conceptual multidimensional representation. 2. Transparency. 3. Availability. 4. Consistent performance in report development. 5. Client-server architecture. 6. General multidimensionality. 7. Dynamic control of sparse matrices. 8. Multi-user support. 9. Unlimited cross operations. 10. Intuitive data manipulation. 11. Flexible options for receiving reports. 12. Unlimited dimension and number of aggregation levels.

Implementation of OLAP Types of OLAP - MOLAP (Multidimensional OLAP) servers - both detailed data and aggregates are stored in a multidimensional database. ROLAP (Relational OLAP) - detailed data is stored in a relational database; aggregates are stored in the same database in specially created service tables. HOLAP (Hybrid OLAP) - detailed data is stored in a relational database, and aggregates are stored in a multidimensional database.

Features of ROLAP - star schema 1. One fact table, which is highly denormalized 2. Several dimension tables, which are also denormalized 3. The primary key of the fact table is composite and has one column for each dimension 4. Aggregated data is stored together with the original Disadvantages If aggregates are stored together with the source data, then in measurements it is necessary to use additional parameter– hierarchy level

Storage structure in ORACLE DBMS SQL clientMOLAP client Java API JDBC OCI ODBC OLE DB CWM or CWM2 OLAP storage (BLOB in a relational table) Star schema Metadata registration Multidimensional core (process in the ORACLE core) OLAP DML SQL interface to OLAP (DBMS_AW, OLAP_TABLE, ... ) Multidimensional metadata

Comparison of Normalized and Unnormalized Models

Analysis of criteria for normalized and non-normalized data models

Let's put together the results of an analysis of the criteria by which we wanted to evaluate the impact of logical data modeling on the quality of physical data models and database performance:

As can be seen from the table, more strongly normalized relationships are better designed (three pluses, one minus). They are more relevant to the subject area, easier to develop, and database modification operations are performed faster for them. True, this is achieved at the cost of some slowdown in the execution of data retrieval operations.

The only advantage of weakly normalized relationships is that if the database is accessed only with queries to retrieve data, then for weakly normalized relationships such queries are executed faster. This is due to the fact that in such relations the connection of relations has already been made and time is not wasted on this when retrieving data.

Thus, the choice of the degree of normalization of relations depends on the nature of the queries with which the database is most often accessed.

It is possible to identify certain classes of systems for which strongly or weakly normalized data models are more suitable.

Highly normalized data models are well suited for so-called OLTP applications (On-Line Transaction Processing (OLTP )- prompt transaction processing ). Typical examples of OLTP applications are warehouse accounting systems, ticket ordering systems, banking systems executing money transfer operations, etc. The main function of such systems is to perform a large number of short transactions. The transactions themselves look relatively simple, for example, “withdraw an amount of money from account A, add this amount to account B.” The problem is that, firstly, there are a lot of transactions, secondly, they are executed simultaneously (several thousand concurrent users can be connected to the system), thirdly, if an error occurs, the transaction must be completely rolled back and return the system to the state that was before the start of the transaction (there should not be a situation where money was withdrawn from account A, but did not arrive to account B). Almost all database queries in OLTP applications consist of insert, update, and delete commands. Select queries are primarily designed to allow users to select from various directories. Most of the requests are thus known in advance at the system design stage. Therefore, speed and reliability of short data update operations is critical for OLTP applications. The higher the level of data normalization in an OLTP application, the faster and more reliable it tends to be. Deviations from this rule can occur when, already at the development stage, some frequently occurring queries are known that require connecting relationships and the speed of execution of which significantly affects the operation of applications. In this case, you can sacrifice normalization to speed up execution similar requests.

Another type of application is the so-called OLAP applications (On-Line Analytical Processing (OLAP ) - operational analytical data processing ). This is a generalized term that characterizes the principles of construction decision support systems (Decision Support System - DSS ), data warehouses (Data Warehouse ), data mining systems (Data Mining ). Such systems are designed to find dependencies between data (for example, you can try to determine how the sales volume of goods is related to the characteristics of potential buyers), to conduct “what if …” analysis. OLAP applications operate on large amounts of data already accumulated in OLTP applications, taken from spreadsheets or other data sources. Such systems are characterized by the following features:

New data is added to the system relatively rarely in large blocks (for example, once a quarter data based on the results of quarterly sales is downloaded from an OLTP application).
Data added to the system is usually never deleted.
Before loading, data undergoes various “cleaning” procedures due to the fact that one system can receive data from many sources that have various formats representations for the same concepts, data may be incorrect or erroneous.
Queries to the system are unregulated and, as a rule, quite complex. Very often, a new query is formulated by an analyst to clarify the result obtained from a previous query.
The speed of query execution is important, but not critical.

Data from OLAP applications is typically represented as one or more hypercubes, the dimensions of which are reference data, and the cells of the hypercube itself store the actual data. For example, you can build a hypercube, the dimensions of which are: time (in quarters, years), type of product and company branches, and the cells store sales volumes. Such a hypercube will contain sales data various types goods by quarters and divisions. Based on this data, you can answer questions like “which division has the best sales volumes this year?”, or “how are the sales trends of the Southwest region divisions this year compared to the previous year?”

Physically, a hypercube can be built based on a special multidimensional data model (MOLAP - Multidimensional OLAP ) or built using a relational data model ( ROLAP - Relational OLAP ).

Returning to the problem of data normalization, we can say that in OLAP systems using the relational data model (ROLAP), it is advisable to store data in the form of weakly normalized relations containing pre-computed basic totals. Great redundancy and the problems associated with it are not scary here, because the update occurs only when a new portion of data is loaded. In this case, both new data is added and the results are recalculated.