Denormalization in Databases

Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data.

What is the advantage of denormalization?

Denormalization can improve performance by: Minimizing the need for joins. Precomputing aggregate values, that is, computing them at data modification time, rather than at select time. Reducing the number of tables, in some cases.

Builder Pattern in java

 In Builder pattern, we have a inner static class named Builder inside our Server class with instance fields for that class and also have a factory method to return an new instance of Builder class on every invocation. The setter methods will now return Builder class reference. We will also have a build method to return instances of Server side class, i.e. outer class.

The builder pattern is a design pattern designed to provide a flexible solution to various object creation problems in object-oriented programming. The intent of the Builder design pattern is to separate the construction of a complex object from its representation. It is one of the Gang of Four design patterns.

What are Datasets and DataFrames in Apache Spark?

 A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java. Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row]. While, in Java API, users need to use Dataset<Row> to represent a DataFrame.

What is Spark SQL?

 Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimization. 

One use of Spark SQL is to execute SQL queries.

When running SQL from within another programming language the results will be returned as a Dataset/DataFrame.

What is Winutils exe?

 Apache Spark requires the executable file winutils.exe to function correctly on the Windows Operating System when running against a non-Windows cluster.

Download winutils.exe binary from WinUtils repository. You should select the version of Hadoop the Spark distribution was compiled with. For example, use hadoop-2.7.1 for Spark 3.0.1.

How to sort objects in java using Comparator?

In the below example class Student has three fields

Name, Age & Mark

Now i am showing 2 types of sorting here

  1. by Name
  2. by Mark
Comparator is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference. This interface is a member of the Java Collections Framework.