Project 4 Workspace Exercise

Instructions

Team Assignment Rules:

  1. A team assignment is required for every student as an important component of graduate education.

  2. Team memberships are decided by the instructor objectively according to the alphabetical order of students' last names with no exceptions. The instructor may intervene and adjust team membership to account for dropped or nonparticipating students.

  3. No personal preferences from any student will be accepted by the instructor in deciding the team memberships.

  4. Each team size can be four to five people in this order of preference. The instructor can use size other than specified only if he/she has to. Initial team size difference between any two teams should be no more than one.

  5. Each team will select a team leader to coordinate the progress and outline the milestones within two weeks after receiving the team assignment.

  6. The individual grade on the team assignment may vary based on student peer reviews and instructor evaluation of actual individual participation and contribution.

Assignment Objectives

Competencies: software development life cycle (SDLC)

Lab Overview

As you perform this lab, you will reinforce the concepts you learned in Project 4, especially the concepts regarding the SDLC and software environment of cloud environments. You will experiment with the Hadoop Sandbox to learn about cloud environments and security.

In this lab, you will learn how a cloud-based environment works and be able to understand how the SDLC and design concepts work in this environment. To accomplish this, the lab team of two people will understand the Hadoop Sandbox tools by completing several tutorials. The remaining team members will research information on SDLC and development methods for preparing the final lab report.

There are two options for the lab team to choose from. Option 1 is to review the Hadoop Sandbox tutorials as described in Appendix B and complete the lab report. Option 2 is to download and run the Hadoop Sandbox VM on their local PC. The Hadoop Sandbox can be freely downloaded and run on a VM product such as Oracle VirtualBox or VMWare. Instructions for doing this can be found in Appendix B and C.

Warning: The UMUC virtual lab support team cannot provide any computer assistance when using your own PC for this lab.

Instructions

All team members should familiarize themselves with the resources provided in the Lab Resources and Appendix A sections of this document. You will find helpful open source links that help you understand the tools you will use in this lab. When finished with the cloud VM sandbox, you can add lessons learned about what the cloud environment means in the next step of cloud-distributed environments (private, public, hybrid clouds, and service-level agreements with providers) for security concerns and issues.

Two team members will understand the Hadoop Sandbox tools by completing several tutorials. This will be the lab subteam. The remaining team members will form an SDLC research subteam and begin the research into traditional waterfall, spiral, and agile/extreme programming approaches and security considerations applied to standalone, client-server, and distributed processing approaches solutions.

Lab Subteam Report

The lab subteam will complete the Hadoop Sandbox tutorials in Appendix B. The students will work through the exercises to get acquainted with cloud-based computing environments. Prepare a lab report of your results, which will be a part of your entire project report.

Be sure to read all appendices before starting the lab since parts of it will be relevant to each subteam. Appendix A describes background information on software system environments and traditional development paradigms. Appendix B has basic instructions for completing the lab. Appendix C has instructions on downloading and installing Oracle VirtualBox and Hadoop Sandbox VM on a PC. Appendix D does the same for use of VMWare.

The lab consists of two parts located in the appendices. Complete each part and collect information to support building your report.

SDLC Subteam Research Report

The SDLC subteam should prepare this portion of your team report. However, include your team resources in editing and constructive feedback. You should assess (compare and contrast) security issues in the life cycle of system solutions created by software system environments. The lifecycle consists of the concept phase, design phase, requirements phase, development implementation phase, initial operational capability (IOC) phase, IOC test phase, final operational capability (FOC) phase, deployment phase, user acceptance test (UAT) phase, and operational/maintenance/enhancements phase.

You should cover the following software system environments in your comparison:

In your discussions of security issues in the software system life cycle, you should also address the use of different development environments. Which environment is easier in considering security issues, and which are more problematic? (provide citations and rationale in your discussions). The development paradigms of interest include:

In this portion of your report, address several key security-related questions. Use tables or spreadsheets to summarize your data, but also discuss the information and findings in your analysis of the results. Answer the following questions in your report:

Cloud provider assurances:

  1. How could a cloud-based solution maintain a proper authentication system for its clients?

  2. How might a cloud-based solution ensure that one client's data is kept confidential and protected from other clients who also have access to the same data center?

  3. What type of assurances would a client expect that the security of the software components and utilities provided by the cloud-based solution provider will be consistently maintained, if the distributed systems are owned and leased from other organizations?

Cloud provider confidentiality:

  1. If we had a company that processed and kept medical imaging data, how might a corporation called Medical Imaging (MI) keep other cloud subscribers from accessing MI's data?

  2. How could the Medical Imaging corporation manage the images split across multiple third-party ISPs?

Cloud provider security policy:

  1. How might another corporation ensure a similar level of security to the cloud-based ISP provider for the Medical Imaging company? What are some common assurances and features? What are some suggestions on cloud-based solutions from the NIST Special Series 800 and 1800 guidance? http://csrc.nist.gov/publications/PubsSPs.html

  2. How could the security policy defined by the cloud-based ISP provider be maintained and ensured at the application level? What types of agreements would be needed? What types of software, systems, and security testing would you require?

Your objective for the report is to compare and contrast the development environments for traditional waterfall approach, spiral approach, agile/extreme programming approach, and cloud-based environment approaches for data processing implementations. Of particular interest will be your identification of computing security and access control security concerns to protect systems, data and users. Consider at what stages of system planning, design, implementation, testing, deployment, and use that you would want cybersecurity persons involved in system creation. What would be the focus of the cybersecurity persons in each stage of the acquisition and creation of the system (in each of the development environment approaches)? How might you do security testing in a distributed computing system, especially for cloud-based systems where the distributed resources might be more uncertain as to where they are located (and who owns them)?

Reflect on your use of computer systems from personal, business, or school experiences. Compare those systems to what you might have been considering and wondering about the Hadoop cloud-based system when you were using it. Which do you consider more secure, and why? What might you do to enhance and better assure customers of security in distributed and cloud-based systems?

Lab Resources

Lab Reference Information

Recommended Presentations for Your Review

Here are some recommended presentations for your review:

https://cloudsecurityalliance.org/education/white-papers-and-educational-material/courseware/

Recommended Reading for Your Review

Here are some recommended readings for your review and help with completing the lab exercise:

Security Guidance for Critical Areas of Focus in Cloud Computing (Version 3.0) https://downloads.cloudsecurityalliance.org/initiatives/guidance/csaguide.v3.0.pdf

NIST Cloud Computing Resource Center https://www.nist.gov/itl/cloud-computing

NIST SP 800-146: NIST Cloud Computing Synopsis and Recommendations http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-146.pdf

NIST SP 800-125: NIST Guide to Security for Full Virtualization Technologies http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-125.pdf

Real Security in Virtual Systems: A Proposed Model for a Comprehensive a Approach to Securing Virtualized Environments http://iacis.org/iis/2008/S2008_1051.pdf

Cloud Security Challenges https://www.researchgate.net/profile/Gurudatt_Kulkarni/publication/239732061_Cloud_Security_Challenges/links/0046351c283daf1730000000.pdf

Cloud Security and Compliance: A Primer (SANS) https://www.sans.org/reading-room/whitepapers/analyst/cloud-security-compliance-primer-34910

Appendix A

Software System Environments (Background Context Information)

Stand-Alone Computing Environment

One of the simplest computing architecture environments for use and security is the stand-alone computer, without any networked connections. It is a self-contained environment, with one or more terminals for one or more concurrent local users. The boundaries are well established, physically and logically able to be monitored for interactions with the applications, data, and hardware environment. However, it has limited practical use in a networked connected world with multiple, diverse data sources required for application processing and analysis. Plus, all the processing is on one computer system, which can slow the overall system process with increasingly large data sets and demanding memory and processor use by applications.

Client- and Client/Server-Based

To off-load some of the data storage and processing requirements, developers realized that one or more separate computer systems hosting the applications and data would make the user's machine run faster. At first, a single server hosted many applications that provided services to a user's smaller program, which interacted with a particular program on the user (client's) machine. This model expanded in a manner where larger applications and data sets were preferred to have their own machine (server) to provide their software application service.

This client server model was originally conceived by connection of machines with (first) RS-232, and RS-488 data lines. Distances over 50 feet required amplifier/repeaters to boost signals due to signal strength loss in the cables. Manufacturers began developing their own unique data connection interface cards, but distances were still limited to a couple hundred feet of cable without repeaters.

The Advanced Research Project Agency (ARPA), the forerunner of DARPA, worked with universities to develop a better solution to "network" computing resources together. Their efforts, ARPANet, is the basis of the Internet that we have today. The concept of network switches, hubs, and routers comes from that research, in addition to the twisted-pair, coaxial, RJ45 standard connections, and the beginning request for standards (RFCs) that govern the Internet protocols we use today.

Distributed Computing Based

With the ability to network computers, the practice of using servers for processing and data storage intensive applications became a flourishing idea. System designers and developers "distributed" their applications on multiple servers, either in "server farms" at a facility having large temperature/humidity/access-controlled rooms; or, by having multiple physical site locations geographically disbursed.

As business continuity planning (COOP planning) considerations arose in business practices, there became an increased desire to have "hot" or "warm" backup systems which could be kept up to date with applications and data. These BCP/COOP sites could be switched into operations using the Internet connections whenever needed, and business operations could continue with no down time.

But, there are other reasons to have distributed computing. Computer scientists and engineers realized that using a "divide and conquer" approach to problem solving could speed up the processing and also tackle larger data sets. The algorithm is based on the use of parallel processing. As an example, let's say you have 100 random numbers that you are asked to sort. You could sort them all by yourself, and it might take you the better part of the next hour or so to do. Or, you could get 10 people together giving them each 10 numbers and have them sort those quickly, taking the results from each of the 10 people in the order of whichever person has the next number. The result would be dividing up the data set, processing each one, then integrating each subset of results into one final integrated result set of sorted numbers. This divide and conquer algorithm would be much faster. In fact, one of the fastest sorting algorithms (QuickSort) takes this very same approach.

So, distributed computing had other advantages: making use of multiple servers running the same applications on a subset of the data, to speed processing and/or to handle larger data set processing using divide-and-conquer techniques.

Another use of multiple servers with the same set of applications is used by each of us, each day. When we query our favorite web search provider by using their web server-based application, our connection to the service goes to a server that does "load balancing." This load balancing divides the incoming user connections and requests to multiple servers in a facility that are executing the search engine applications for the web search provider – each with access to their web search databases and identical application services. Here, the handling of large volumes of user requests are distributed to computing devices that run in parallel on different servers, which appear to be one "service" by the client user.

Cloud-Based Processing

The next step in the evolution of distributed processing is to make data processing infrastructure components (servers, applications, and networks) not only in parallel for a given company facility in a server farm, but to distribute the resources globally and even across service providers. Companies needing fast computing for very large data volumes will enter into service-level agreement (SLA) contracts with Internet service providers (ISPs), which have very large server farms distributed throughout a country, continent, and globally. These processing centers may be contracted out to vendors which work for the ISP.

The data processing using cloud resources depends upon the following criteria:

There are also three concepts of cloud resources and cloud architectures:

(Note: if you hear in various technology news articles the term "government cloud," that is simply a private cloud architecture owned by one or more government organizations).

The Conceptual Reference Model

The below diagram presents an overview of the NIST cloud computing reference architecture, which identifies the major actors, their activities and functions in cloud computing. The diagram depicts a generic high-level architecture and is intended to facilitate the understanding of the requirements, uses, characteristics and standards of cloud computing.

Source: NIST SP 500-292

The Cloud Conceptual Reference Model

The National Institute of Standards and Technology (NIST) gives this definition in the NIST Definition of Cloud Computing, Special Publication 800-145:

"Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models."

Service models are SaaS (software as a service), PaaS (platform as a service), and IaaS (infrastructure as a service). Architecture designs for cloud computing use different components, in part based on the services they are using in the cloud. Each level gives the user additional control.

Source: NIST SP 500-292

Source: NIST SP 500-292

Development Approaches

Traditional Waterfall Model

The software system creation process which has six successive stages of development to deployment is known as the traditional waterfall model. A diagram of the names and flow of these stages is depicted in this diagram (analysis, requirements specification, design, implementation, testing and integration, and operations and maintenance):

Spiral Model

Spiral design models have stages that reflect the waterfall model, except that there are multiple passes through those stages. In each pass, certain design and development criteria have been prioritized to work on in the process of creating the overall system. Spiral designs models may use two or as many spirals as are needed to create the system.

Agile/Extreme Programming Model

The agile design model is one of the category of extreme programming models. Agile is a popular approach today for systems engineering, software engineering, and hardware systems development efforts. It is characterized by having one or more teams work on prioritized components of a system to be developed. The prioritization may include factors of what is needed soonest in the system capabilities, and also which parts of the system implementation might be easiest to generate, and which components to be developed are deemed risky, not commercially available, and implementations never attempted before.

Agile is often a method chosen for smaller efforts in attempt to develop technologies for larger programs which are not readily available in the marketplace. Agile can also be used when rapid prototype and development of a system is needed to try in a field environment, as a proof of concept prior to building new systems based on its design.

Agile efforts are usually done in two-week to three-month "spins" or "increments." Some agile teams will have informal stand-up meetings each day (or at least once a week) to assess what team members tried, what worked, what needs a different approach, and how far along they are toward realizing a solution, or if another solution needs to be considered. There are less formal requirements documentation, test plans and procedures than would be in traditional or spiral approaches. Agile teams are more likely to capture design "issues" and needs using collaborative tools such as JIRA to keep track of requirements and progress.

Cloud-Based Model

Cloud-based development models have three approaches in development and providing of cloud based services:

References

Wikipedia. (n.d.) Software as a service. Retrieved from https://en.wikipedia.org/wiki/Software_as_a_service

Wikipedia. (n.d.) Platform as a service. Retrieved from https://en.wikipedia.org/wiki/Platform_as_a_service

Wikipedia. (n.d.). Infrastructure as a service. Retrieved from https://en.wikipedia.org/wiki/Cloud_computing#Infrastructure_as_a_service_.28IaaS.29

Cloud Computing Security

ENISA's Cloud Computing: Benefits, Risks and Recommendations for Information Security is a good report from the European Network Information Security Agency.

Other References of Interest in Cloud Computing Security:

This ends this part of the lab. Continue to the next part below.

Appendix B

Hands-On Cloud Environment Familiarization Exercise (Background Information)

Hadoop, by Hortonworks, is the basis of nearly all cloud-based environments today. Having some familiarity with it and how it works will be beneficial in your considerations of discussions and issues related to cloud-based security and awareness.

This lab introduces the Hortonworks Hadoop cloud environment. Hadoop is the foundation of all current cloud-based processing systems: Hortonworks, Cloudera, MapR, Azure, and Amazon Web Services (AWS).

This lab should be conducted by two of your group members, so that your team can obtained "lessons learned" regarding the nature and use of cloud computing. This knowledge will be useful when your team considers the processing environment and security issues that exist with the Cloud-based Model in providing computing solutions in industry.

While the tutorials may have a number of steps, it is not too difficult to follow the tutorials used. It will take some time to go through the tutorials, which is why only two team members should be dedicated to performing that exercise and providing the group feedback on their experiences. However, as you have time, each team member is encouraged to try out the tutorials.

If you choose to host the VM Sandbox on your own machine you will need to have Oracle's VirtualBox, VMWare's VMPlayer or VMWare Workstation and understand how to configure the VirtualBox or VMWare tool to run the Sandbox VM. Also the Hadoop Sandbox VM requires at least 8GB of RAM and the tool hosting the VM (VirtualBox or VMWare) may require an additional 2GB of RAM. The more RAM the better the response and performance of the lab. Both the Hadoop Sandbox VM and Oracle VirtualBox are free to download and use. instructions for downloading VirtualBox and Installing Hadoop VM is in Appendix B. Appendix D provides instructions on using VMWare.

If you have Oracle VirtualBox, VMWare's VMPlayer or VM Workstation, you might also download the Sandbox to your personal computer to use it also. It can be found at: http://hortonworks.com/hadoop-tutorial/learning-the-ropes-of-the-hortonworks-sandbox/?_bt=104841495198&_bk=hortonworks%20sandbox&_bm=e&_bn=g&gclid=COD9qLyf6c8CFdZZhgodejoK3g

In case you want to later place the Sandbox VM on your personal computer, the download URL is: http://hortonworks.com/products/hortonworks-sandbox/#install

"Your Goal and Objective, should you decide to undertake this cloud-based mission…"

The Sandbox tutorials are tried and tested by many. They will not "self-destruct in five minutes" (have fun learning and experimenting with the cloud environment).

Step-by-Step Instructions

We will walk through one of the Hortonworks tutorial exercises so that you can gain familiarity with the cloud environment and how it works. These can be done by reviewing the tutorials or completing the tutorial with Hadoop Sandbox loaded on your own PC as illustrated in Appendix C and Appendix D.

  1. Go to the "Hello World" Hortonworks tutorial area and review/complete the hands-on tutorial "Step 1: Learning the Ropes of the Hortonworks Sandbox." It can be found at http://hortonworks.com/hadoop-tutorial/learning-the-ropes-of-the-hortonworks-sandbox/.

  2. Next review/complete the hands-on tutorial introduction, concepts and Lab 1 of "Step 2: Hadoop Tutorial – Getting Started with HDP." These can be found at http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/.

    Source: HortonWorks.

  3. The Hadoop Data File System is known as HDFS. There are elements of the labs in Step 2 that demonstrate how to move data in and out of the cloud environment. Review/complete as many hands-on labs in Step 2 of the Hortonworks tutorial as you can to become familiar with the Hadoop cloud environment. There are various applications developed for the Hadoop cloud environment. Some work on raw data sets (such as ETL, Spark, Apache NiFi, and Zeppelin), and others are SQL or No-SQL database applications, such as HBase, HIVE and PIG. The tutorial will use some of these applications.

  4. To learn more about security in Hadoop, review some of the information under the security tutorials. If time allows, try to review/complete some of the tutorials. The following are some suggested topics:

These can be found at http://hortonworks.com/hadoop-tutorial/securing-data-lake-auditing-user-access-using-hdp-security/.

Source: HortonWorks.

Collect your data and prepare your report.

Appendix C, "Setting Up Oracle VirtualBox and Hadoop Sandbox on a Personal Computer," contains material from Set Up the Mininet Network Simulator by Brian Linkletter, which is used under the Creative Commons Attribution-NonCommercial 4.0 International license.

Appendix C

Setting Up Oracle VirtualBox and Hadoop Sandbox on a Personal Computer

If you decide to perform the Hadoop Sandbox exercises hands-on, you need to download and configure both Oracle VirtualBox and then the Hadoop Sandbox. First, download Oracle VirtualBox VM application.

Step-by-Step Instructions

  1. Go to https://www.virtualbox.org/wiki/Downloads and select the appropriate version of VirtualBox binaries software for your personal computer. VirtualBox is free and can run on various systems. Download and then click on the install file to load VirtualBox.

  2. Follow the prompts to finish the install.

  3. Download the Hadoop Sandbox VM and store it on your local PC. The Hadoop Sadnbox can be downloaded for free at: http://hortonworks.com/downloads/#sandbox. Be sure to Select Hortonworks Sandbox on a VM. Select the HDP on Hortonworks Sandbox version for VirtualBox.

  4. Start VirtualBox on your PC by double-clicking on the virtualbox icon. Next, create a version of the Hadoop Sandbox virtual machine that will run in VirtualBox by importing the Hadoop Sandbox virtual machine into the VirtualBox program.

  5. Start the VirtualBox manager application on your host system.

Source: Oracle

Figure 1. VirtualBox Manager

  1. Next, import the Hadoop virtual machine by using the VirtualBox menu command:

  2. File → Import Appliance

  3. In the next screen, click the "Open appliance" button.

  4. Navigate to the folder containing the HDP_2.5_virtualbox.ova or similar file and select it.

  5. Source: Oracle

    Figure 2. Import Virtual Appliance Screen

  6. Then, click the "Continue" button to get to the Appliance Settings screen. Use the default settings, but you can change the virtual machine's name, if you wish. I recommend changing the name from vm to Hadoop. Click on the "Import" button.

  7. Source: Oracle

    Figure 3. VM Settings

  8. After a few minutes, you will see the Hadoop VM you imported in the VirtualBox window.

  9. Now you must create a "host only" network interface in VirtualBox. This creates a loopback interface on the host computer that can be used to connect the virtual machine to the host computer (or to other virtual machines).

  10. Open the VirtualBox preferences panel. Use the VirtualBox menu command:

  11. VirtualBox → Preferences.

    Source: Oracle

    Figure 4. VirtualBox Network Section for Host-Only Networks

  12. Click on the "Network" icon in the Preferences panel. Then, click on the small green "plus" sign on the right side of the window to add a new network adapter. An adapter called will be created. The default settings should be acceptable.

  13. Source: Oracle

    Figure 5. Setting Host-Only Network for VM

  14. Check the settings by clicking on the small "screwdriver" icon on the right side of the window to edit the adapter's configuration. Make a note of the IP address. In this case, the default IP address used by VirtualBox for the first host-only adapter is 192.168.56.1/24.

  15. Source: Oracle

    Figure 6. VM adapter settings

  16. The DHCP server is enabled on the interface and we see that the Lower Address Bound is 192.168.56.101/24. So, we know that the IP address of the virtual interface connected to the host-only network on the virtual machine will be assigned that IP address.

  17. Source: Oracle

    Figure 7. VM DHCP Server Settings

  18. For future use, note the following information:

  19. Now, add a network adapter to Hadoop Sandbox virtual machine. In the VirtualBox Manager window, click on the Hadoop virtual machine and then click on the "Settings" icon on the top of the window. Click on the "Network" icon in the settings panel that appears. The virtual machine already has one interface defined. On the "Adapter 1" tab, we see an interface set up as a NAT.

  20. Source: Oracle

    Figure 8. Network Adapter 1 Settings

  21. This will allow the virtual machine to connect to the Internet. But to use Hadoop, we still need a way for the virtual machine to connect directly to the host computer. So, we need to add another virtual adapter and connect it to the "host-only network" interface we created earlier.

  22. Click on the "Adapter 2" tab and, in the "Attached to:" field, select "Host-only network."This allows other programs running on your host computer to connect to the VM using SSH. Since only one host-only network is currently created, VirtualBox will automatically select the vboxnet0 host-only network.

  23. Source: Oracle

    Figure 8. Setting Network Adapter 2 for Host-Only Settings

  24. Click the "OK" button. Now the network settings are configured for the Hadoop Sandbox virtual machine. You may change some of the other settings if you want to but the default values for all other settings will work well.

  25. Now start Hadoop VM. In the VirtualBox manager, select the Hadoop virtual machine and then click the "Start" button to start the Hadoop VM.

  26. Source: Oracle

    Figure 9. VirtualBox Manager

  27. The VM will boot up and present you with a login prompt.

Appendix D

Instructions for using VMWare and the Hadoop Tutorial

These are alternative instructions for completing the lab hands-on using VMWare. These are for doing the lab on your own computer. The following are the steps for installing VMWare and then Hadoop tutorials.

Step-by-Step Instructions

  1. Download the VM Player from the website: http://www.vmware.com/products/player/playerpro-evaluation.html

    Source: VMWare.

  2. Click on the Windows version Download Now link, save the file (if using Firefox, it will be saved into your Downloads folder). Click on the download to begin installation.

    Source: VMWare.

  3. Select Next…

    Source: VMWare.

  4. Accept and select Next again…

    Source: VMWare.

  5. Select Next again…

    Source: VMWare.

  6. Select Next again…

    Source: VMWare.

  7. Select Next again…

    Source: VMWare.

  8. Allow it to try to upgrade, so that it will install.

    Source: VMWare.

  9. If you have an earlier version, it will uninstall that, and then install the latest version.

    Source: VMWare.

  10. Click Finish to exit the installation. The free version does not require a license.

    Source: VMWare.

  11. You should see a VMWare Workstation icon on your desktop. Click on it to run it.

    Source: VMWare.

  12. Enter a valid e-mail address. Then, you will be able to select Continue to finish installation of the free version. The VM Player will run. If you had a prior installation, you will see any existing VMs that are already installed.

    Source: VMWare.

  13. To run an existing VM, in the left pane, just double-click on it.

    Note: When you are in a VM, to move the cursor mouse back to the host OS, click CTRL and ALT together. To move the cursor back inside the VM window, double-click inside the VM window.

  14. There are two primary ways to install new VMs: 1) from the ISO image file (usually done for an operating system to be installed, like Linux or MS-Window; you can also install from an install CD/DVD disk); and, 2) installing a previously created VM. For method one, click on Create a New Virtual Machine. For method two, click on Open a Virtual Machine. Once created, the VM will show up in the list of installed VMs in the left pane.

Two Examples for Loading the Hadoop Sandbox VMs

Example 1: Hortonworks Hadoop VM Sandbox Installation

Here is an example of installation of Hortonworks Hadoop VM Sandbox, as downloaded from the source webpage http://hortonworks.com/products/sandbox/. The latest Hadoop Cloud Sandbox upon download is a file called: HDP_2.5_vmware.ova. We will use VM Player to Open a Virtual Machine, then browse to the location of this file that was downloaded from the sandbox website.

  1. Click Open

  2. Source: HortonWorks

  3. Then import the following screen.

  4. The VM will install into your VM Player.

  5. Source: VMWare

  6. Once installed, highlight (select) the Hadoop (HDP) VM in the left pane, and then click on the green triangle to Play virtual machine. Notice it tells you this VM will require 8GB of physical memory on your machine.

  7. Source: VMWare

  8. If you see this software updates message, you can click on Remind Me Later (we do not need this to continue).

This Hadoop VM uses CentOS Linux 7 as its operating system. It will load when the VM starts, and it may take some time.

Source: VMWare

Example 2: MapR Hadoop VM Sandbox Installation

This example will show you how to install the MapR version of the Hadoop sandbox. The source for the MapR Sandbox is located at the webpage: https://www.mapr.com/products/mapr-sandbox-hadoop/download Download for VMWare VM. The latest Hadoop Cloud Sandbox upon download is a file called: MapR-Sandbox-For-Hadoop-5.2.0-vmware.ova. We will use VM Player to Open a Virtual Machine, then browse to the location of this file that was downloaded from the sandbox website.

  1. Click Open.

  2. Source: HortonWorks

  3. Then Import on the following screen:

  4. The VM will install into your VM Player.

  5. Source: VMWare.

  6. Once installed, highlight (select) the MapR Hadoop (HDP) VM in the left pane, and then click on the green triangle to Play virtual machine. Notice it tells you this VM will require 6GB of physical memory on your machine.

  7. If you see this software updates message, you can click on Remind Me Later (we do not need this to continue).

  8. This MapR Hadoop VM uses CentOS Linux 6.7 as its operating system. It will load when the VM starts. It may take some time.

Source: VMWare.