Google Open Source Blog: databases

Showing posts with label databases. Show all posts

Full support of PostgreSQL engine comes to Logica

Tuesday, September 26, 2023

Logica is a logic programming language designed for intuitive and efficient data manipulation, which we open sourced in 2020. It compiles to SQL, providing access to the power of SQL engines with the convenience of a logic programming syntax.

When it was open sourced, Logica's only fully supported engine was BigQuery, a powerful data warehouse, executing queries with high parallelization and processing terabytes of data within seconds.

Modern machines can store and process significant amounts of data, even within a single computer. Thus relational SQL databases are as popular as ever. They contain a lot of data and its analysis is important. Among open source database options, PostgreSQL and SQLite are some of the most popular database engines (example1, example2). Logica added support for SQLite in 2021.

Now we are pleased to announce a new release of Logica that adds support for PostgreSQL.

As Logica compiles to SQL, it is natural to extend the language to use PostgreSQL as the engine. However, there are nuances in the SQL dialect of Postgres which require addressing. The biggest distinction is that PostgreSQL requires types of records to be explicitly spelled out in your query, while BigQuery determines the types automatically.

For example, consider a Logica predicate where for each user we collect a list of records with information about their purchases.

UserPurchases(

user_id:,

user_name:,

purchases? List= {item_name:, item_price:}) distinct :-

Purchase(purchase_id:, user_id:, item_name:, item_price:),

UserInfo(user_id:, user_name:);

We can translate this Logica predicate to GoogleSQL to run on BigQuery as follows:

SELECT

user_id,

user_name,

ARRAY_AGG(STRUCT(item_name as item_name, item_price as item_price)) as purchases

FROM

Purchases INNER JOIN

UserInfo USING (user_id)

GROUP BY 1, 2;

Logica's record {item_name:, item_price:} simply compiles into GoogleSQL's STRUCT(item_name as item_name, item_price as item_price).

However, in the dialect of PostgreSQL composite types must be explicitly defined and specified. In our example, we need to define the type PurchaseRecord with fields item_name and item_price. We should also specify in the query that the purchases column is aggregating records of type PurchaseRecord. Thus PostgreSQL query for our predicate would be written like so.

CREATE TYPE PurchaseRecord as (item_name text, item_price numeric);

SELECT

user_id,

user_name,

ARRAY_AGG(ROW(item_name,

item_price)::PurchaseRecord) AS purchases

FROM

Purchase INNER JOIN

UserInfo USING (user_id)

GROUP BY UserInfo.user_id, UserInfo.user_name;

Records and lists are also useful as intermediates in calculations, even if the input and output data are normalized. For example, we have a table called ItemSales and want to find a list of most sold items in each of the stores that the table describes. Specifically, we want to assemble a table with information about the top three most sold items among all of the stores. For each of the items, we may want to list the department of the store where the item is being sold. This can be achieved intuitively using the ArgMax3 aggregate function, which accumulates all the information about the items that we need, and no extra join is needed.

# Collecting information of top 3 most sold items for each store.

StoreTopItemsCollection(store) ArgMax3= {item:,

department:} -> sales_volume :-

ItemSales(store:, item:, department:, sales_volume:);

# Flattening top items collection.

StoreTopItems(store:, item:, department:) :-

{item:, department:} in StoreTopItemsCollection(store);

(PostgreSQL colab link, online SQLite playground link)

To support the PostgreSQL engine, we extended the Logica compiler with type inference. Logica now infers data types for all expressions that a user employs. For records and arrays, Logica specifies their type in the produced SQL, just as PostgreSQL requires. Commands to create necessary types are produced as part of the compiled SQL. In this collab, we show an example of a program that writes a PostgreSQL table, and in this collab, we show how to give type hints when the program does not have enough information for complete inferences.

As a byproduct of type inference, we were able to improve error messages. Now that we know the types, we can point to the user where a mistake is made within the Logica program, rather than the user having to debug the generated SQL statement.

PostgreSQL is a popular and powerful engine. It is easy to start your own instance (maybe just in CoLab!), or use a serverless option. We are excited to provide users of Logica with the option to run on Postgres. If you already use PostgreSQL, we encourage you to give Logica a try, it is a joy to write data analysis with logic programming! If you have any feedback or questions, please share at the discussion section of Logica repository.

By Evgeny Skvortsov, Software Engineer – Google

El Carro for Oracle: Data migration and improved backups

Wednesday, August 10, 2022

In May 2021, we released El Carro to make it easier to run Oracle databases on Kubernetes. Our following blog dove deeper into El Carro’s features, announced support for Oracle 19c, and detailed more flexibility for building database images. Today we’re excited to open source two new features to enhance El Carro and make it easier to manage your Oracle deployments: Data Migration and Point-in-Time Recovery. Automated Data Migration makes it much easier to re-platform to El Carro and Point-in-Time Recovery is a standard feature that database professionals come to expect because it enables you to drastically reduce RPO and worry less about backup frequency.

Data Migration

The Data Migration feature of El Carro enables users to migrate data from their existing database to an El Carro database running on Kubernetes. This functionality allows users to re-platform to El Carro with minimal disruption. The two most common pathways shown in the image below are to 1) modernize in place by simply migrating your database to Kubernetes so you can leverage the automation of El Carro and to 2) migrate to Kubernetes in the cloud.

Migrate your database in place or to the cloud

Typical migration sources include AWS (RDS, EKS, EC2), Azure (AKS, VMs), GCP (GKE, GCE, BMS) or on-premises deployments. Typical migration targets include any Kubernetes installation on GCP (GKE), AWS (EKS), Azure (AKS), or on-premises.

The Data Migration feature offers two automated migration flows and two manual ones.

Category	Options	Migration Downtime	Complexity
Automated	Data Guard with physical standby	minimal	lowest
Automated	Data pump	long	low
Manual	RMAN-based migration	long	medium
Manual	Data pump	long	medium

Migration Downtime: required downtime to migrate source database into El Carro without data loss. Minimal means less downtime, long means more downtime.
Complexity: summarizes the difficulty and complexity of the migration journey.

Point-in-Time Recovery

Since its release, El Carro has provided users the ability to take backups and restore via RMAN or storage-based snapshots. Today we’re excited to release a new Point-in-Time Recovery feature to enhance El Carro’s backup functionality by automatically backing up archive redo logs to a GCS (Google Cloud Storage) bucket and allowing users to seamlessly restore their databases to any point in time within a user configurable window. This optional feature provides an additional layer of protection and enhanced restore granularity without interfering with manual backups or affecting database performance.

The diagram below contrasts the new versus old functionality. Previously, there were discrete restore points (shown in green on the top arrow) which represented limited opportunities to restore. With Point-in-Time Recovery, the entire arrow is green, meaning the recovery functionality is continuous, with restore points at any time along the green arrow.

With Point-in-Time Recovery you can restore to any point after the first backup

Conclusion

As always, you can try the open source El Carro operator for free (Apache 2.0 license) wherever you run Oracle databases. Follow the quick start guide and try out provisioning of instances, databases, users. Import data via Data Pump, manage instance parameters, choose between different methods for backups, and try out a restore. Have a look at how we integrate with external logging and monitoring solutions. Reach out via our Google group and leave feedback for what features you would like to see next, or even create your own patch, issue or pull request on GitHub.

By Kyle Meggs, Product Manager and Ash Gbadamassi, Software Engineer – Cloud Databases

Four areas of open source contributions from Cloud Databases

Tuesday, October 26, 2021

Open Cloud enables you to develop software faster, innovate more easily, and scale more efficiently—while also reducing technology risk. Google has a long history of leadership in open source, and today, I want to look back at our activities around open source projects, for databases, over the past year.

Give developers the best tools to be efficient

Developers choose to build applications with managed database services on Google Cloud to benefit from velocity, scalability, security, and performance. To enable you to be most efficient and deliver your best possible work, we deliver tools and frameworks that work with your preferred development environments, no matter if you develop in the cloud or on premises. To make local testing, building and continuous integration easier for our cloud-native databases, we released emulators for Cloud Spanner, Firestore, and Cloud Bigtable so that you can test your code wherever you develop it - without the need to create or re-create cloud infrastructure with every test run.

Another area where we are helping developers is with instrumentation of Cloud SQL for easier debugging and performance tuning. With Cloud SQL Insights it is easier than ever to pinpoint underperforming SQL statements. That said, without additional instrumentation, it can be cumbersome to identify the source code or microservice that issued that SQL - let alone tying a SQL statement to a client session and its context. So we released Sqlcommenter as an open source library that will automatically add this instrumentation as SQL comments in queries that are generated by popular ORMs like Hibernate, Django, Sqlalchemy, and others (repo blog). We didn’t stop there, but merged Sqlcommenter with OpenTelemetry (blog) to add SQL insights from instrumented queries back to OpenTelemetry traces.

Lastly, we want to broaden access to our differentiated offerings, like Spanner. The recently announced Spanner PostgreSQL interface allows organizations to access Spanner’s industry-leading consistency and availability at scale using tools and skills from the popular PostgreSQL ecosystem. This new way of working with Spanner provides familiarity for developers and portability for administrators. (blog) Learn more in the documentation or sign up for the preview today.

Provide connectivity that is simple and secure

Connecting to APIs and databases from an application running in the cloud should be simple and secure. That’s why we recommend using IAM and Application Default Credentials when authenticating to other services. The Cloud SQL Proxy (repo) has been doing this and also setting up firewalls for you for a while. It works by running a local client either inside your VM or a GKE cluster. This year, we added libraries for Java (repo) and Python (repo) that can provide similar functionality without the overhead of running an extra client such as the proxy.

Cloud Spanner also offers an open source adapter for its new PostgreSQL interface (repo). This local proxy allows tools, starting with psql, to connect to a Spanner database using the PostgreSQL wire protocol.

Manage cloud infrastructure with the tools of your choice

When it comes to provisioning, monitoring, and managing your cloud database services, flexibility and choice are important. We provide you with our cloud console, gcloud cli, and APIs as well as our own Deployment Manager. That said, you may prefer different ways to manage cloud infrastructure - whether through interactive tools or scripts or embedded into CI/CD pipelines that support GitOps or other controls, checks, and balances. Terraform is one of those open tools that is very popular - and we ensure that our cloud databases can be managed from it as documented in this blog about creating Spanner instances with Terraform.

If you manage the majority of your resources with Kubernetes either directly or through package managers like Helm, then our Kubernetes Config Connector (KCC) might be for you. In a nutshell, KCC exposes Google Cloud services such as Cloud SQL, Spanner, and others as Custom Resources in Kubernetes. This allows you to create and reconcile cloud resources outside of Kubernetes just like K8s native objects.

Once you are managing cloud infrastructure with CI/CD, the next step is to extend that same mechanism to manage objects within your databases such as tables, indexes, and views. To that extent we have released a Liquibase extension for Cloud Spanner.

Help you to move data with confidence

Cloud journeys often involve moving data either in a lift and shift process or sometimes replatforming to a different database. Whatever your journey, we want to simplify the process and give you the confidence that your migration is successful.

For enterprise users with Oracle databases, we have several open source projects. First, we have the Optimus Prime database assessment tool (repo) that queries your database and collects information about schemas and historic performance to be analyzed for migration complexity and consolidation potential. Our own professional services teams have been using this toolset to plan migrations to Bare Metal Solution for Oracle.

Some Oracle users are looking for opportunities to transform their workloads to fit with their bigger strategy of modernizing applications with Kubernetes. For this group we developed and open sourced the El Carro Kubernetes operator for Oracle. This not only automates database lifecycle tasks for systems running on Kubernetes, but also exposes declarative APIs for these operations.

If your application supports replatforming from Oracle to PostgreSQL, then we have a toolset for schema conversion along with dataflow pipelines that will read the output of a change data capture job and load it into a PostgreSQL database. What a great use-case for Datastream - our new serverless change data capture service.

Another case of heterogeneous database migration is to move MySQL or PostgreSQL databases to Cloud Spanner. HarbourBridge helps with the evaluation and data migration, and our latest contribution was adding support for DynamoDB as a source database. Part of every heterogeneous migration should be to validate that the source and target data are matching - we have released the Data Validation Toolkit for that use-case. DVT can connect to a number of source and target databases and compare the data on each side - giving you the confidence that your migration did not miss or change any records.

Conclusion

Whether you are migrating existing databases or you are building your next application in the cloud - we want to make your journey as comfortable and seamless as possible. Open source projects play a big role in meeting you where you are and providing you with the connectivity options, language support, and tools you want for management and migrations.

By Bjoern Rost, Product Manager, Google Cloud Databases

Modernizing Oracle operations with Kubernetes and El Carro

Thursday, May 13, 2021

Google Cloud is releasing El Carro, an open source tool to help you transform and modernize your Oracle database operations. El Carro implements the Kubernetes operator pattern to deliver automation for provisioning and ongoing operations like backups, patching, and high availability for databases running in hybrid and multi-cloud environments. And it does so using the same declarative syntax that DevOps teams are using to manage applications. With El Carro, users can choose to modernize and transform their database operations in place and benefit from a consistent management experience and hybrid and multi-cloud portability. Released under the Apache License 2.0, you are free to use El Carro in any Kubernetes environment—you are in control.

Containers and Kubernetes deliver portability on standardized infrastructure, and today Oracle supports databases running in containers; they’ve also released container build files and images and helm charts to simplify provisioning. What is missing for the next level of integration is support for lifecycle operations and an extension of the Kubernetes API to the primitives needed for database management.

In addition, fully managed or autonomous services for Oracle may not make available all the required features, such as Active Data Guard, Multitenant, and In-Memory, parameters/flags, versions, and patch levels. DBAs also find themselves locked out of many roles, including sysadmin and root. These restrictions make many cloud architects fall back to lift and shift Oracle databases onto infrastructure as a service offerings and miss out on opportunities to modernize and transform database operations. And with transactional databases growing in number and criticality, organizations are struggling to deliver innovation and modernization. Engineers are already busy keeping up with sprawl and mundane operational tasks while adhering to strict change management processes.

How do we solve this database operations gap?

El Carro solves this. It is built with scalability in mind, using the same container orchestration infrastructure, Kubernetes, that powers many businesses and is a top choice for modern architectures. Its open API allows you to manage your database configurations as declarative code, enabling CI/CD or Gitops workflows for auditability and control mechanisms. El Carro automates many database lifecycle operations, like backups, replication, and patching. And, when it distributes databases on the nodes of a cluster, it is aware of the priority and resource requirements of each database to optimize tight packing while respecting quality of service. Lastly, it helps DBAs by delivering automation without restrictions and leaving DBAs in full control over their systems. You can choose to let the operator drive for you, but you can also take over the steering wheel yourself at any time.

Because Kubernetes is now the standard for portable infrastructure automation and orchestration, engineers appreciate how Kubernetes abstracts complex problems into manageable infrastructure as code. Kubernetes can scale from small projects to large projects that support the infrastructure that powers Google products and services for billions of users around the world. Moreover, Google pioneers the next generation of infrastructure as code that we refer to as Configuration as Data to declaratively establish a contract between developer intent and the runtime operation. According to the Cloud Native Survey 2020, two-thirds of respondents were either already running stateful workloads in production or were considering doing so within the next 12 months. We expect that datastores are going to drive the next wave of enterprise Kubernetes adoption.

A number of open source operators for databases, such as PostgreSQL, MySQL, and many others, have been released, are actively maintained by the community, and are popular among developers and architects looking for a hands-off approach to manage databases with their applications. El Carro extends the list of database operators to include Oracle.

What are we building with El Carro for Oracle databases?

The operator pattern emerged in late 2016 as an extension of the Kubernetes API and control loop aimed at automating more complicated and application-specific tasks that are beyond the native Kubernetes objects.

El Carro implements a custom resource definition (CRD), which is tailored to database management. Users set and change attributes of the custom resource using the Kubernetes API the same way they do for built-in objects such as pods, deployments, or services. The El Carro controller observes changes to the CRs and compares the declared state with the current reality in the cluster, then makes the necessary changes. Those changes could either affect the Kubernetes resources used by the database such as persistent volumes or the pod itself, or may result in issuing calls via SQL or command line tools to the database to create and modify users or other database objects.

Here’s a look at how this works:

El Carro Architecture

The diagram above shows how the major components of a database managed by the El Carro Operator interact with each other. The controller monitors the CRD for any changes made by admins. It creates and manages the cluster resources that make up the actual database deployment: persistent volumes for filesystems and data, a pod to run containers with the actual database, and a daemon that allows the controller to securely run SQL commands on the database. And lastly, a service makes sqlnet connections available to applications and end users that can either run in the same Kubernetes cluster or outside of it.

At release time, the El Carro Operator can provision Oracle databases of 12c Enterprise Edition and 18c Express Edition. It manages instance parameters, pluggable databases, and users. You can take and restore backups either using rman or storage snapshots, and we are working to add additional features.

How to get involved with El Carro?

In the development process, we collaborated with users and partners in the Oracle community to help us validate the approach. "Pythian has helped Oracle users to automate and optimize the operations of their mission-critical systems for over 20 years,” says Simon Pane, principal consultant at Pythian. “We are excited about the possibilities that El Carro brings to users on their cloud modernization journeys. We are proud to work with the community on a vision for the future of database management.".

Sean Scott covers Docker for databases on his blog oraclesean.com, and says: "There are many benefits to running Oracle databases in containers. Adding Kubernetes orchestration introduces new opportunities to bring the DevOps and Oracle communities together."

You can try out El Carro today. Follow the quick start guide and try out provisioning of instances, databases, users. Import data via Data Pump, manage instance parameters, choose between different methods for backups, and try out a restore. Have a look at how we integrate with external logging and monitoring solutions. Reach out via our Google group and leave feedback for what features you would like to see next, or even create your own patch and pull request on GitHub.

By Bjoern Rost - Product Manager and Boris Dali - Team Lead, Engineering

Logica: organizing your data queries, making them universally reusable and fun

Monday, April 12, 2021

We present Logica, a novel open source Logic Programming language. A successor to Yedalog (a language developed at Google earlier) it is a Datalog-like logic programming language. Logica code compiles to SQL and runs on Google BigQuery (with experimental support for PostgreSQL and SQLite), but it is much more concise and supports the clean and reusable abstraction mechanisms that SQL lacks. It supports modules and imports, it can be used from an interactive Python notebook and it even makes testing your queries natural and easy.

“Data is the new oil”, they say, and SQL is so far the lingua franca for working with data. When SQL (or “Structured English Query Language”, as it was first named) was invented in the 1970s, its authors might not have imagined the popularity that it would reach half a century later. Today, systems ranging from tiny smart watch applications to enterprise IT solutions, read and write their data using SQL. Even the browser that you are using to read this post now might have a working built-in SQL database in it.

Despite the widespread adoption, SQL is not flawless. Constructing statements from long chains of English words (which are often capitalized to keep the old-fashioned COBOL spirit of the 70s alive!) can be very verbose—a single query spanning hundreds of lines is a routine occurrence. The main flaw of SQL, however, lies in its very limited support for abstraction.

Good programming is about creating small, understandable, reusable pieces of logic that can be tested, given names, and organized into packages which can later be used to construct more useful pieces of logic. SQL resists this workflow. Although you can encapsulate certain repeated computations into views and functions, the syntax and support for these can vary among implementations, the notions of packages and imports are generally nonexistent, and higher-level constructions (e.g. passing a function to a function) are impossible.

This inherent resistance to decomposition of logic into bite-sized pieces is what leads into the contrived, lengthy queries, the copy-pasted chunks of code and, eventually, unmaintainable, unstructured (note the irony) SQL codebases. To make things worse, SQL code is rarely tested, because “testing SQL queries” sounds rather esoteric to most engineers, at best. Because of that, a number of alternative query languages and libraries have been developed. Of those, systems based on logic programming perhaps come the closest to addressing SQL’s limitations.

Logic programming languages solve problems of SQL by using syntax of mathematical propositional logic rather than natural English language. The language of formal logic was designed by mathematicians specifically to make expression of complex statements easier and suits this purpose much better than natural language. Logica extends classical Logic programming syntax further, most notably with aggregation, hence the name, which stands for

Logica = Logic + Aggregation.

Let us see how it all works. SQL operates with relations, which are sets of rows. In logic programming the analog of a relation is a predicate. While a predicate is a set of rows, we think of it as a logical condition, which describes the rows of a relation. Here is, for example, the definition of a simple predicate:

MagicNumber(x: 2);

MagicNumber(x: 3);

MagicNumber(x: 5);

The definition claims that the condition MagicNumber(x) must hold when X is precisely either 2, 3, or 5. That means, if we were to query this predicate (i.e. request all values of X that satisfy it), the output should be a “relation” with a single column X and rows 2, 3, and 5. The SQL equivalent would be:

SELECT 2 AS x

UNION ALL

SELECT 3 AS x

UNION ALL

SELECT 5 AS x;

Rather than listing the individual values, we could have defined the predicate by encoding a logical condition upon X as follows:

MagicNumber(x:) :-

x in [2, 3, 5];

Now, here is where the magic starts. Firstly, any table in your database is itself already a predicate, so the following definition:

MagicComment(comment_text:) :-

`comments`(user_id:, comment_text:),

user_id == 5;

Defines a predicate MagicComment, which includes precisely those comment_text values, which are present in the comments table where user_id == 5. In SQL this would read:

SELECT comment_text FROM comments WHERE user_id = 5;

Observe what happens if we replace the condition “user_id == 5” in our predicate with MagicNumber(x: user_id):

MagicComment(comment_text:) :-

`comments`(user_id:, comment_text:),

MagicNumber(x: user_id);

Here, we are querying for comments of users whose ID is one of the “magic numbers” we just defined above. Note how easily we could reuse a previously defined piece of code without having to copy anything around. We could now even extract the MagicNumber to a common module and import it in wherever it is needed:

import my_project.magic.MagicNumber;

As a final example, let us mock the comments table, in a unittest of a query.

import my_project.magic.MagicComment;

MockComments(user_id: 1, comment_text: "Hello");

MockComments(user_id: 2, comment_text: "Logic");

MockComments(user_id: 3, comment_text: "Programming");

MagicCommentTest := MagicComment(`comments`: MockComments);

If we query the MagicComment predicate here, it will not try to read the comments table in the database. Instead, it will use the predicate we just defined, thus letting us verify its correctness by testing the output (it must include two rows “Logic” and “Programming”). Observe how natural and frictionless many of the good programming practices become with Logica, and compare that to what you would have to do to achieve the same using bare SQL.

There is much more to Logica, so make sure you give it a try—chances are, you will love it! Start with this tutorial to learn Logica. Even if you do not end up using it in your next project, learning a new powerful language may open your mind to new ideas and perspectives on data processing and computing in general.

The simple examples above are only a small sample of how concise Logica code can be over SQL for complex queries. In particular, we did not even touch the topic of aggregations in this article. For all of this see examples section of the Logica open source repository.

We also hope that some of the readers consider contributing to Logica development. That’s what open source is all about!

By Konstantin Tretyakov and Evgeny Skvortsov – Logica Open Source Project