PowerShell Useful Commands

Domain Name


Start -> Run -> CMD
nslookup set type=all
_ldap._tcp.dc._msdcs.DOMAIN_NAME

MySQL - Adding Remote User

To add mysql user with remote access to the database you have to:
bind mysql service to external IP address on the server
add mysql user for remote connection
grant user permissions to access the database
In order to connect remotely you have to have MySQL bind port 3306 to your server’s external IP.
Edit my.cnf:

#Replace xxx with your IP Address
bind-address = xxx.xxx.xxx.xxx
Restart mysql as you change config. If you don’t have firewall enabled, you should have access to mysql service from external clients now.

Now you have to have created the user in both localhost and ‘%’ wildcard and grant permissions on all DB’s as such . Open mysql and run commands:

CREATE USER 'myuser'@'localhost' IDENTIFIED BY 'mypass';
CREATE USER 'myuser'@'%' IDENTIFIED BY 'mypass';
Then

GRANT ALL ON *.* TO 'myuser'@'localhost';
GRANT ALL ON *.* TO 'myuser'@'%';
This will let myuser access all databases from server as well from external sources. Depending on your OS you may have to open port 3306 to allow remote connections. If it’s your case – look for firewall (iptables in Linux OSes) configuration

Databricks

 


Databricks is a unified analytics platform that helps organizations to solve their most challenging data problems. It is a cloud-based platform that provides a single environment for data engineering, data science, and machine learning.

Databricks offers a wide range of features and capabilities, including:

  • Apache Spark: Databricks is built on Apache Spark, a unified analytics engine for large-scale data processing.
  • Delta Lake: Delta Lake is a unified data lake storage format that provides ACID transactions, version control, and lineage.
  • MLflow: MLflow is an open source platform for managing the end-to-end machine learning lifecycle.
  • Workspaces: Databricks Workspaces provide a secure and collaborative environment for data scientists and engineers to work together.
  • Notebooks: Databricks Notebooks are a powerful tool for data exploration, analysis, and visualization.
  • Jobs: Databricks Jobs are a way to automate data pipelines and workflows.
  • Monitoring: Databricks provides a comprehensive monitoring dashboard that provides visibility into your data and workloads.

Databricks is a popular choice for organizations of all sizes. It is used by some of the world's largest companies, such as Airbnb, Spotify, and Uber.

Here are some of the benefits of using Databricks:

  • Speed: Databricks can help you to process large amounts of data quickly and efficiently.
  • Scalability: Databricks is scalable, so you can easily add more resources as your needs grow.
  • Ease of use: Databricks is easy to use, even for non-technical users.
  • Collaboration: Databricks provides a collaborative environment for data scientists and engineers to work together.
  • Security: Databricks is secure, so you can be confident that your data is safe.

If you are looking for a unified analytics platform that can help you to solve your most challenging data problems, then Databricks is a good choice.

Here are some of the use cases for Databricks:

  • Data engineering: Databricks can be used to build and manage data pipelines.
  • Data science: Databricks can be used to develop and deploy machine learning models.
  • Business intelligence: Databricks can be used to create interactive dashboards and reports.
  • Regulatory compliance: Databricks can be used to help organizations comply with regulations, such as GDPR and CCPA.
  • Research: Databricks can be used to conduct research and analysis on large datasets.

If you are interested in learning more about Databricks, I recommend that you visit the Databricks website.

Data Catalog

 A data catalog is a system that collects and organizes metadata about data assets. It provides a central repository for information about the data, such as its source, format, and usage. Data catalogs can be used to help people find and use the data they need, and to improve the overall management of data assets.

Here are some of the benefits of using a data catalog:

  • Improved data discovery: Data catalogs can help people find the data they need by providing a central repository for information about the data. This can save time and effort, and it can help to ensure that people are using the most accurate and up-to-date data.
  • Increased data usability: Data catalogs can make data more usable by providing information about the data's format, lineage, and quality. This can help people understand the data and to use it more effectively.
  • Improved data governance: Data catalogs can help to improve data governance by providing information about the data's ownership, access control, and security. This can help to ensure that the data is managed in a secure and compliant manner.
  • Reduced data duplication: Data catalogs can help to reduce data duplication by providing information about the data's location and usage. This can help to prevent people from creating duplicate copies of the data.
  • Improved data quality: Data catalogs can help to improve data quality by providing information about the data's lineage and quality. This can help to identify and correct errors in the data.

There are two main types of data catalogs:

  • Enterprise data catalogs: These are designed to be used by entire organizations. They typically store metadata about all of the data assets in the organization.
  • Self-service data catalogs: These are designed to be used by individual users or teams. They typically store metadata about the data assets that are relevant to the user or team.

Data catalogs can be implemented using a variety of technologies, such as Hadoop, Hive, and Spark. The best technology for your organization will depend on your specific needs and requirements.

If you are considering implementing a data catalog in your organization, I recommend that you do the following:

  • Define your goals: The first step is to define your goals for the data catalog. What do you want to achieve by implementing a data catalog?
  • Identify your stakeholders: The next step is to identify your stakeholders. Who will be using the data catalog?
  • Assess your current state: The next step is to assess your current state of data management. What are your strengths and weaknesses?
  • Develop a plan: The next step is to develop a plan for implementing the data catalog. This plan should include the goals, stakeholders, and resources needed for the data catalog.
  • Implement the plan: The next step is to implement the plan for the data catalog. This may involve making changes to your policies, procedures, and technology.
  • Monitor and improve: The final step is to monitor and improve the data catalog. This will help you to ensure that the data catalog is effective and that it meets your goals.

By following these steps, you can implement a data catalog in your organization and reap the benefits that it has to offer.

KERBEROS - ACL Example

 Here is an example of a kadm5.acl file:

*/admin@ATHENA.MIT.EDU    *                               # line 1
joeadmin@ATHENA.MIT.EDU   ADMCIL                          # line 2
joeadmin/*@ATHENA.MIT.EDU i   */root@ATHENA.MIT.EDU       # line 3
*/root@ATHENA.MIT.EDU     ci  *1@ATHENA.MIT.EDU           # line 4
*/root@ATHENA.MIT.EDU     l   *                           # line 5
sms@ATHENA.MIT.EDU        x   * -maxlife 9h -postdateable # line 6

(line 1) Any principal in the ATHENA.MIT.EDU realm with an admin instance has all administrative privileges except extracting keys.

(lines 1-3) The user joeadmin has all permissions except extracting keys with his admin instance, joeadmin/admin@ATHENA.MIT.EDU (matches line 1). He has no permissions at all with his null instance, joeadmin@ATHENA.MIT.EDU (matches line 2). His root and other non-admin, non-null instances (e.g., extra or dbadmin) have inquire permissions with any principal that has the instance root (matches line 3).

(line 4) Any root principal in ATHENA.MIT.EDU can inquire or change the password of their null instance, but not any other null instance. (Here, *1 denotes a back-reference to the component matching the first wildcard in the actor principal.)

(line 5) Any root principal in ATHENA.MIT.EDU can generate the list of principals in the database, and the list of policies in the database. This line is separate from line 4, because list permission can only be granted globally, not to specific target principals.

(line 6) Finally, the Service Management System principal sms@ATHENA.MIT.EDU has all permissions except extracting keys, but any principal that it creates or modifies will not be able to get postdateable tickets or tickets with a life of longer than 9 hours.

KERBEROS Cheet Sheet

This summary is not available. Please click here to view the post.

Importance of change advisory board (CAB)

 

A change advisory board (CAB) is a group of people who meet regularly to review and approve changes to an organization's IT infrastructure. The CAB helps to ensure that changes are made in a controlled and orderly manner, and that they do not impact the business negatively.

The importance of a CAB can be summarized as follows:

  • Ensures that changes are reviewed and approved by a group of experts: The CAB typically includes representatives from different areas of the organization, such as IT, business, and operations. This ensures that changes are reviewed from all angles and that any potential risks are identified and mitigated.
  • Provides a forum for communication and collaboration: The CAB provides a forum for stakeholders to discuss changes and to reach consensus on the best course of action. This helps to ensure that changes are implemented smoothly and that everyone is on the same page.
  • Helps to improve the quality of changes: The CAB can help to ensure that changes are well-planned, well-tested, and documented. This helps to reduce the risk of errors and problems.
  • Helps to improve the efficiency of change management: The CAB can help to streamline the change management process and to identify opportunities for improvement. This can help to save time and money.
  • Helps to build trust and credibility: The CAB can help to build trust and credibility between IT and the business. This is important for ensuring that changes are supported by the business and that they are implemented successfully.

Overall, the CAB is an important part of any organization's change management process. By ensuring that changes are reviewed and approved by a group of experts, the CAB helps to improve the quality, efficiency, and success of changes.

Here are some of the benefits of having a CAB:

  • Improved decision-making: The CAB can help to improve decision-making by providing a forum for discussion and debate. This can help to ensure that all perspectives are considered and that the best possible decision is made.
  • Increased visibility: The CAB can help to increase visibility of changes by providing a forum for communication and collaboration. This can help to ensure that everyone is aware of changes and that they are implemented smoothly.
  • Reduced risk: The CAB can help to reduce risk by identifying and mitigating potential problems. This can help to prevent changes from impacting the business negatively.
  • Improved efficiency: The CAB can help to improve efficiency by streamlining the change management process. This can help to save time and money.
  • Increased compliance: The CAB can help to ensure that changes comply with all relevant regulations. This can help to protect the organization from fines and penalties.

If you are considering implementing a CAB in your organization, I recommend that you do the following:

  • Define the scope of the CAB: The first step is to define the scope of the CAB. This will help to determine who should be involved and what issues should be discussed.
  • Identify the members of the CAB: The next step is to identify the members of the CAB. The members should be experts from different areas of the organization, such as IT, business, and operations.
  • Establish the meeting schedule: The CAB should meet regularly to review and approve changes. The meeting schedule should be agreed upon by all members.
  • Develop the meeting agenda: The CAB should have a clear agenda for each meeting. This will help to ensure that the meeting is productive.
  • Document the decisions: The decisions of the CAB should be documented. This will help to ensure that everyone is aware of the decisions that have been made.

By following these steps, you can ensure that your CAB is successful.

Installing Apache Zeppelin on a Hadoop Cluster

Apache Zeppelin(https://zeppelin.incubator.apache.org/)  is a web-based notebook that enables interactive data analytics. You can make data-driven, interactive and collaborative documents with SQL, Scala and more.


This document describes the steps you can take to install Apache Zeppelin on a CentOS 7 Machine.


Steps

Note: Run all the commands as Root


Configure the Environment

Install Maven (If not already done)

cd /tmp/

wget https://archive.apache.org/dist/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz

tar xzf apache-maven-3.1.1-bin.tar.gz -C /usr/local

cd /usr/local

ln -s apache-maven-3.1.1 maven

Configure Maven (If not already done)

#Run the following

export M2_HOME=/usr/local/maven

export M2=${M2_HOME}/bin

export PATH=${M2}:${PATH}

Note: If you were to login as a different user or logout these settings will be whipped out so you won’t be able to run any mvn commands. To prevent this, you can append these export statements to the end of your ~/.bashrc file:


#append the export statements

vi ~/.bashrc

#apply the export statements

source ~/.bashrc


Install NodeJS


Note: Steps referenced from https://nodejs.org/en/download/package-manager/


curl --silent --location https://rpm.nodesource.com/setup_5.x | bash -


yum install -y nodejs

Install Dependencies

Note: Used for Zeppelin Web App


yum install -y bzip2 fontconfig

Install Apache Zeppelin

Select the version you would like to install

View the available releases and select the latest:


https://github.com/apache/zeppelin/releases


Override the {APACHE_ZEPPELIN_VERSION} placeholder with the value you would like to use.



Download Apache Zeppelin

cd /opt/

wget https://github.com/apache/zeppelin/archive/{APACHE_ZEPPELIN_VERSION}.zip

unzip {APACHE_ZEPPELIN_VERSION}.zip

ln -s /opt/zeppelin-{APACHE_ZEPPELIN_VERSION-WITHOUT_V_INFRONT} /opt/zeppelin

rm {APACHE_ZEPPELIN_VERSION}.zip

Get Build Variable Values

Get Spark Version

Running the following command


spark-submit --version

Override the {SPARK_VERSION} placeholder with this value.


Example: 1.6.0


Get Hadoop Version

Running the following command


hadoop version

Override the {HADOOP_VERSION} placeholder with this value.


Example: 2.6.0-cdh5.9.0


Take the this value and get the major and minor version of Hadoop. Override the {SIMPLE_HADOOP_VERSION} placeholder with this value.


Example: 2.6


Build Apache Zeppelin

Update the bellow placeholders and run


cd /opt/zeppelin

mvn clean package -Pspark-{SPARK_VERSION} -Dhadoop.version={HADOOP_VERSION} -Phadoop-{SIMPLE_HADOOP_VERSION} -Pvendor-repo -DskipTests

Note: this process will take a while


 


Configure Apache Zeppelin

Base Zeppelin Configuration

Setup Conf

cd /opt/zeppelin/conf/

cp zeppelin-env.sh.template zeppelin-env.sh

cp zeppelin-site.xml.template zeppelin-site.xml

Setup Hive Conf

# note: verify that the path to your hive-site.xml is correct

ln -s /etc/hive/conf/hive-site.xml /opt/zeppelin/conf/

Edit zeppelin-env.sh

Uncomment export HADOOP_CONF_DIR

Set it to export HADOOP_CONF_DIR=“/etc/hadoop/conf”


Starting/Stopping Apache Zeppelin

Start Zeppelin

/opt/zeppelin/bin/zeppelin-daemon.sh start

Restart Zeppelin

/opt/zeppelin/bin/zeppelin-daemon.sh restart

Stop Zeppelin

/opt/zeppelin/bin/zeppelin-daemon.sh stop

Viewing Web UI

Once the zeppelin process is running you can view the WebUI by opening a web browser and navigating to:


http://{HOST}:8080/


Note: Network rules will need to allow this communication


Runtime Apache Zeppelin Configuration

Further configurations maybe needed for certain operations to work


Configure Hive in Zeppelin

Open the cloudera manager and get the public host name of the machine that has the HiveServer2 role. Identify this as HIVESERVER2_HOST

Open the Web UI and click the Interpreter tab

Change the Hive default.url option to: jdbc:hive2://{HIVESERVER2_HOST}:10000


How to check the MD5 checksum of a downloaded file

 Issue:

You would like to verify the integrity of your downloaded files.

Solution:

WINDOWS:

Download the latest version of WinMD5Free.

Extract the downloaded zip and launch the WinMD5.exe file.

Click on the Browse button, navigate to the file that you want to check and select it.

Just as you select the file, the tool will show you its MD5 checksum.

Copy and paste the original MD5 value provided by the developer or the download page.

Click on Verify button.

MAC:

Download the file you want to check and open the download folder in Finder.

Open the Terminal, from the Applications / Utilities folder.

Type md5 followed by a space. Do not press Enter yet.

Drag the downloaded file from the Finder window into the Terminal window.

Press Enter and wait a few moments.

The MD5 hash of the file is displayed in the Terminal.

Open the checksum file provided on the Web page where you downloaded your file from.

The file usually has a .cksum extension.

NOTE: The file should contain the MD5 sum of the download file. For example: md5sum: 25d422cc23b44c3bbd7a66c76d52af46

 Compare the MD5 hash in the checksum file to the one displayed in the Terminal.

If they are exactly the same, your file was downloaded successfully. Otherwise, download your file again.


LINUX:

Open a terminal window.

Type the following command: md5sum [type file name with extension here] [path of the file] -- NOTE: You can also drag the file to the terminal window instead of typing the full path.

Hit the Enter key.

You’ll see the MD5 sum of the file. 

Match it against the original value.

Cannot Login to Cloudera Manager with LDAP/LDAPS Enabled

Summary

After changing ‘Authentication Backend Order’ to external, users cannot login. This guide explains how to revert back to default behaviour, authenticating through database first.

Symptoms

Users cannot login to Cloudera Manager

Conditions

Cloudera Manager boots up

Login page accessible through the browser

External authentication is enabled (LDAP, LDAP with TLS = LDAPS)

Authentication Backend Order, was changed to external authentication.

Cause

Cloudera Manager is trying to connect to LDAP If auth_backend_order is set to external only or external and DB. A misconfiguration with LDAP or External authentication is causing Cloudera Manager Server to unable to map users credential appropriately.

Instructions

Please follow the instructions to fix this.

Note: Take backup of the SCM database [0]

By deleting auth_backend_order order config Cloudera Manager falls back to the DB_ONLY auth backend and will not try to connect to the LDAP server.

Step 1: 

Stop the Cloudera Manager server

$sudo service cloudera-scm-server stop

Confirm the auth_backend_order is other than non-default ie: not DB_ONLY or nothing.


Step – 2:

Run this query in the Cloudera Manager schema to reset the Authentication Backend Order configuration:

Connect mysql DB: 

./mysql -u root -p

mysql>use scm;

mysql> select ATTR, VALUE from CONFIGS where ATTR = “auth_backend_order”;

Delete the auth_backend_order attribute from Cloudera Manager database (this will revert to default behavior). Run below query in the Cloudera Manager schema to reset the Authentication Backend Order configuration:

mysql> delete from CONFIGS where ATTR = “auth_backend_order” and SERVICE_ID is null;


Step – 3:

Start the Cloudera Manager server

$sudo service cloudera-scm-server start


Try to login now with admin user.


Reference

https://www.devopsbaba.com/cannot-login-to-cloudera-manager-with-ldap-ldaps-enabled/


Running Multiple workers of Apache Spark on Windows

 Below steps have been tried on 2 different Windows 10 laptops, with two different Spark versions (2.x.x) and with Spark 3.1.2.

Image source: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Apache_Spark_logo.svg/1280px-Apache_Spark_logo.svg.png

Installing Prerequisites

PySpark requires Java version 7 or later and Python version 2.6 or later.

  1. Java

To check if Java is already available and find it’s version, open a Command Prompt and type the following command.

java -version

If the above command gives an output like below, then you already have Java and hence can skip the below steps.

java version "1.8.0_271"
Java(TM) SE Runtime Environment (build 1.8.0_271-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.271-b09, mixed mode)

Java comes in two packages: JRE and JDK. Which one to use between the two depends on whether you want to just use Java or you want to develop applications for Java.

You can download either of them from the Oracle website.

Do you want to run Java™ programs, or do you want to develop Java programs? If you want to run Java programs, but not develop them, download the Java Runtime Environment, or JRE™

If you want to develop applications for Java, download the Java Development Kit, or JDK™. The JDK includes the JRE, so you do not have to download both separately.

For our case, we just want to use Java and hence we will be downloading the JRE file.
1. Download Windows x86 (e.g. jre-8u271-windows-i586.exe) or Windows x64 (jre-8u271-windows-x64.exe) version depending on whether your Windows is 32-bit or 64-bit
2. The website may ask for registration in which case you can register using your email id
3. Run the installer post download.

Note: The above two .exe files require admin rights for installation.

In case you do not have admin access to your machine, download the .tar.gz version (e.g. jre-8u271-windows-x64.tar.gz). Then, un-gzip and un-tar the downloaded file and you have a Java JRE or JDK installation.
You can use 7zip to extract the files. Extracting the .tar.gz file will give a .tar file- extract this one more time using 7zip.
Or you can run the below command in cmd on the downloaded file to extract it:

tar -xvzf jre-8u271-windows-x64.tar.gz

Make a note of where Java is getting installed as we will need the path later.

2. Python

Use Anaconda to install- https://www.anaconda.com/products/individual

Use below command to check the version of Python.

python --version

Run the above command in Anaconda Prompt in case you have used Anaconda to install it. It should give an output like below.

Python 3.7.9

Note: Spark 2.x.x don’t support Python 3.8. Please install python 3.7.x. For more information, refer to this stackoverflow question. Spark 3.x.x support Python 3.8.

Scripted setup

Following steps can be scripted as a batch file and run in one go. Script has been provided after the below walkthrough.

Getting the Spark files

Download the required spark version file from the Apache Spark Downloads website. Get the ‘spark-x.x.x-bin-hadoop2.7.tgz’ file, e.g. spark-2.4.3-bin-hadoop2.7.tgz.

Spark 3.x.x also come with Hadoop 3.2 but this Hadoop version causes errors when writing Parquet files so it is recommended to use Hadoop 2.7.

Make corresponding changes to remaining steps for the chosen spark version.

You can extract the files using 7zip. Extracting the .tgz file will give a .tar file- extract this one more time using 7zip. Or you can run the below command in cmd on the downloaded file to extract it:

tar -xvzf spark-2.4.3-bin-hadoop2.7.tgz

Putting everything together

Setup folder

Create a folder for spark installation at the location of your choice. e.g. C:\spark_setup.

Extract the spark file and paste the folder into chosen folder: C:\spark_setup\spark-2.4.3-bin-hadoop2.7

Adding winutils.exe

From this GitHub repository, download the winutils.exe file corresponding to the Spark and Hadoop version.

We are using Hadoop 2.7, hence download winutils.exe from hadoop-2.7.1/bin/.

Copy and replace this file in following paths (create \hadoop\bin directories)

  • C:\spark_setup\spark-2.4.3-bin-hadoop2.7\bin
  • C:\spark_setup\spark-2.4.3-bin-hadoop2.7\hadoop\bin

Setting environment variables

We have to setup below environment variables to let spark know where the required files are.

In Start Menu type ‘Environment Variables’ and select ‘Edit the system environment variables

Click on ‘Environment Variables…

Add ‘New…’ variables

  1. Variable name: SPARK_HOME
    Variable value: C:\spark_setup\spark-2.4.3-bin-hadoop2.7 (path to setup folder)
  2. Variable name: HADOOP_HOME
    Variable value: C:\spark_setup\spark-2.4.3-bin-hadoop2.7\hadoop
    OR
    Variable value: %SPARK_HOME%\hadoop
  3. Variable name: JAVA_HOME
    Variable value: Set it to the Java installation folder, e.g. C:\Program Files\Java\jre1.8.0_271
    Find it in ‘Program Files’ or ‘Program Files (x86)’ based on which version was installed above. In case you used the .tar.gz version, set the path to the location where you extracted it.
  4. Variable name: PYSPARK_PYTHON
    Variable value: python
    This environment variable is required to ensure tasks that involve python workers, such as UDFs, work properly. Refer to this StackOverflow post.
  5. Select ‘Path’ variable and click on ‘Edit…

Click on ‘New and add the spark bin path, e.g. C:\spark_setup\spark-2.4.3-bin-hadoop2.7\bin OR %SPARK_HOME%\bin

All required Environment variables have been set.

Optional variables: Set below variables if you want to use PySpark with Jupyter notebook. If this is not set, PySpark session will start on the console.

  1. Variable name: PYSPARK_DRIVER_PYTHON
    Variable value: jupyter
  2. Variable name: PYSPARK_DRIVER_PYTHON_OPTS
    Variable value: notebook

Scripted setup

Edit and use below script to (almost) automate PySpark setup process.

Steps that aren’t automated: Java & Python installation, and Updating the ‘Path’ variable.

Install Java & Python before, and edit the ‘Path’ variable after running the script, as mentioned in the walkthrough above.

Make changes to the script for the required spark version and the installation paths. Save it as a .bat file and double-click to run.

Using PySpark in standalone mode on Windows

You might have to restart your machine post above steps in case below commands don’t work.

Commands

Each command to be run in a separate Anaconda Prompt

  1. Deploying Master
    spark-class.cmd org.apache.spark.deploy.master.Master -h 127.0.0.1
    Open your browser and navigate to: http://localhost:8080/. This is the SparkUI.
  2. Deploying Worker
    spark-class.cmd org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077
    SparkUI will show the worker status.

3. PySpark shell
pyspark --master spark://127.0.0.1:7077 --num-executors 1 --executor-cores 1 --executor-memory 4g --driver-memory 2g --conf spark.dynamicAllocation.enabled=false

Adjust num-executorsexecutor-coresexecutor-memory and driver-memory as per machine config. SparkUI will show the list of PySparkShell sessions.

Activate the required python environment in the third anaconda prompt before running the above pyspark command.

The above command will open Jupyter Notebook instead of pyspark shell if you have set the PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS Environment variables as well.

Alternative

Run below command to start pyspark (shell or jupyter) session using all resources available on your machine. Activate the required python environment before running the pyspark command.

pyspark --master local[*]

Please let me know in comments if any of the steps give errors or you face any kind of issues.