Team Assignment Rules:
A team assignment is required for every student as an important component of graduate education.
Team memberships are decided by the instructor objectively according to the alphabetical order of students' last names with no exceptions. The instructor may intervene and adjust team membership to account for dropped or nonparticipating students.
No personal preferences from any student will be accepted by the instructor in deciding the team memberships.
Each team size can be four to five people in this order of preference. The instructor can use size other than specified only if he/she has to. Initial team size difference between any two teams should be no more than one.
Each team will select a team leader to coordinate the progress and outline the milestones within two weeks after receiving the team assignment.
The individual grade on the team assignment may vary based on student peer reviews and instructor evaluation of actual individual participation and contribution.
Complete the Hadoop cloud sandbox tutorials.
Use manuals and general guidance to generate experimental results report.
Learn cloud computing needs and uses.
Identify security issues and access control concerns to protect systems, data and users.
Apply security to software application acquisition, design and development effort models.
Compare and contrast various development environments.
Complete cloud computing exercises.
Prepare and submit a report on findings.
Competencies: software development life cycle (SDLC)
As you perform this lab, you will reinforce the concepts you learned in Project 4, especially the concepts regarding the SDLC and software environment of cloud environments. You will experiment with the Hadoop Sandbox to learn about cloud environments and security.
In this lab, you will learn how a cloud-based environment works and be able to understand how the SDLC and design concepts work in this environment. To accomplish this, the lab team of two people will understand the Hadoop Sandbox tools by completing several tutorials. The remaining team members will research information on SDLC and development methods for preparing the final lab report.
There are two options for the lab team to choose from. Option 1 is to review the Hadoop Sandbox tutorials as described in Appendix B and complete the lab report. Option 2 is to download and run the Hadoop Sandbox VM on their local PC. The Hadoop Sandbox can be freely downloaded and run on a VM product such as Oracle VirtualBox or VMWare. Instructions for doing this can be found in Appendix B and C.
Warning: The UMUC virtual lab support team cannot provide any computer assistance when using your own PC for this lab.
All team members should familiarize themselves with the resources provided in the Lab Resources and Appendix A sections of this document. You will find helpful open source links that help you understand the tools you will use in this lab. When finished with the cloud VM sandbox, you can add lessons learned about what the cloud environment means in the next step of cloud-distributed environments (private, public, hybrid clouds, and service-level agreements with providers) for security concerns and issues.
Two team members will understand the Hadoop Sandbox tools by completing several tutorials. This will be the lab subteam. The remaining team members will form an SDLC research subteam and begin the research into traditional waterfall, spiral, and agile/extreme programming approaches and security considerations applied to standalone, client-server, and distributed processing approaches solutions.
Lab Subteam Report
The lab subteam will complete the Hadoop Sandbox tutorials in Appendix B. The students will work through the exercises to get acquainted with cloud-based computing environments. Prepare a lab report of your results, which will be a part of your entire project report.
Be sure to read all appendices before starting the lab since parts of it will be relevant to each subteam. Appendix A describes background information on software system environments and traditional development paradigms. Appendix B has basic instructions for completing the lab. Appendix C has instructions on downloading and installing Oracle VirtualBox and Hadoop Sandbox VM on a PC. Appendix D does the same for use of VMWare.
The lab consists of two parts located in the appendices. Complete each part and collect information to support building your report.
SDLC Subteam Research Report
The SDLC subteam should prepare this portion of your team report. However, include your team resources in editing and constructive feedback. You should assess (compare and contrast) security issues in the life cycle of system solutions created by software system environments. The lifecycle consists of the concept phase, design phase, requirements phase, development implementation phase, initial operational capability (IOC) phase, IOC test phase, final operational capability (FOC) phase, deployment phase, user acceptance test (UAT) phase, and operational/maintenance/enhancements phase.
You should cover the following software system environments in your comparison:
client-server model computing
distributed computing model
cloud computing model (the focus should be primarily on this model currently used in industry).
In your discussions of security issues in the software system life cycle, you should also address the use of different development environments. Which environment is easier in considering security issues, and which are more problematic? (provide citations and rationale in your discussions). The development paradigms of interest include:
traditional waterfall model
agile/extreme programming model
In this portion of your report, address several key security-related questions. Use tables or spreadsheets to summarize your data, but also discuss the information and findings in your analysis of the results. Answer the following questions in your report:
Cloud provider assurances:
How could a cloud-based solution maintain a proper authentication system for its clients?
How might a cloud-based solution ensure that one client's data is kept confidential and protected from other clients who also have access to the same data center?
What type of assurances would a client expect that the security of the software components and utilities provided by the cloud-based solution provider will be consistently maintained, if the distributed systems are owned and leased from other organizations?
Cloud provider confidentiality:
If we had a company that processed and kept medical imaging data, how might a corporation called Medical Imaging (MI) keep other cloud subscribers from accessing MI's data?
How could the Medical Imaging corporation manage the images split across multiple third-party ISPs?
Cloud provider security policy:
How might another corporation ensure a similar level of security to the cloud-based ISP provider for the Medical Imaging company? What are some common assurances and features? What are some suggestions on cloud-based solutions from the NIST Special Series 800 and 1800 guidance? http://csrc.nist.gov/publications/PubsSPs.html
How could the security policy defined by the cloud-based ISP provider be maintained and ensured at the application level? What types of agreements would be needed? What types of software, systems, and security testing would you require?
Your objective for the report is to compare and contrast the development environments for traditional waterfall approach, spiral approach, agile/extreme programming approach, and cloud-based environment approaches for data processing implementations. Of particular interest will be your identification of computing security and access control security concerns to protect systems, data and users. Consider at what stages of system planning, design, implementation, testing, deployment, and use that you would want cybersecurity persons involved in system creation. What would be the focus of the cybersecurity persons in each stage of the acquisition and creation of the system (in each of the development environment approaches)? How might you do security testing in a distributed computing system, especially for cloud-based systems where the distributed resources might be more uncertain as to where they are located (and who owns them)?
Reflect on your use of computer systems from personal, business, or school experiences. Compare those systems to what you might have been considering and wondering about the Hadoop cloud-based system when you were using it. Which do you consider more secure, and why? What might you do to enhance and better assure customers of security in distributed and cloud-based systems?
The tutorials should be available from www.hortonworks.com/tutorials/
Lab Reference Information
Recommended Presentations for Your Review
Here are some recommended presentations for your review:
Recommended Reading for Your Review
Here are some recommended readings for your review and help with completing the lab exercise:
Security Guidance for Critical Areas of Focus in Cloud Computing (Version 3.0) https://downloads.cloudsecurityalliance.org/initiatives/guidance/csaguide.v3.0.pdf
NIST Cloud Computing Resource Center https://www.nist.gov/itl/cloud-computing
NIST SP 800-146: NIST Cloud Computing Synopsis and Recommendations http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-146.pdf
NIST SP 800-125: NIST Guide to Security for Full Virtualization Technologies http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-125.pdf
Real Security in Virtual Systems: A Proposed Model for a Comprehensive a Approach to Securing Virtualized Environments http://iacis.org/iis/2008/S2008_1051.pdf
Cloud Security Challenges https://www.researchgate.net/profile/Gurudatt_Kulkarni/publication/239732061_Cloud_Security_Challenges/links/0046351c283daf1730000000.pdf
Cloud Security and Compliance: A Primer (SANS) https://www.sans.org/reading-room/whitepapers/analyst/cloud-security-compliance-primer-34910
Software System Environments (Background Context Information)
Stand-Alone Computing Environment
One of the simplest computing architecture environments for use and security is the stand-alone computer, without any networked connections. It is a self-contained environment, with one or more terminals for one or more concurrent local users. The boundaries are well established, physically and logically able to be monitored for interactions with the applications, data, and hardware environment. However, it has limited practical use in a networked connected world with multiple, diverse data sources required for application processing and analysis. Plus, all the processing is on one computer system, which can slow the overall system process with increasingly large data sets and demanding memory and processor use by applications.
Client- and Client/Server-Based
To off-load some of the data storage and processing requirements, developers realized that one or more separate computer systems hosting the applications and data would make the user's machine run faster. At first, a single server hosted many applications that provided services to a user's smaller program, which interacted with a particular program on the user (client's) machine. This model expanded in a manner where larger applications and data sets were preferred to have their own machine (server) to provide their software application service.
This client server model was originally conceived by connection of machines with (first) RS-232, and RS-488 data lines. Distances over 50 feet required amplifier/repeaters to boost signals due to signal strength loss in the cables. Manufacturers began developing their own unique data connection interface cards, but distances were still limited to a couple hundred feet of cable without repeaters.
The Advanced Research Project Agency (ARPA), the forerunner of DARPA, worked with universities to develop a better solution to "network" computing resources together. Their efforts, ARPANet, is the basis of the Internet that we have today. The concept of network switches, hubs, and routers comes from that research, in addition to the twisted-pair, coaxial, RJ45 standard connections, and the beginning request for standards (RFCs) that govern the Internet protocols we use today.
Distributed Computing Based
With the ability to network computers, the practice of using servers for processing and data storage intensive applications became a flourishing idea. System designers and developers "distributed" their applications on multiple servers, either in "server farms" at a facility having large temperature/humidity/access-controlled rooms; or, by having multiple physical site locations geographically disbursed.
As business continuity planning (COOP planning) considerations arose in business practices, there became an increased desire to have "hot" or "warm" backup systems which could be kept up to date with applications and data. These BCP/COOP sites could be switched into operations using the Internet connections whenever needed, and business operations could continue with no down time.
But, there are other reasons to have distributed computing. Computer scientists and engineers realized that using a "divide and conquer" approach to problem solving could speed up the processing and also tackle larger data sets. The algorithm is based on the use of parallel processing. As an example, let's say you have 100 random numbers that you are asked to sort. You could sort them all by yourself, and it might take you the better part of the next hour or so to do. Or, you could get 10 people together giving them each 10 numbers and have them sort those quickly, taking the results from each of the 10 people in the order of whichever person has the next number. The result would be dividing up the data set, processing each one, then integrating each subset of results into one final integrated result set of sorted numbers. This divide and conquer algorithm would be much faster. In fact, one of the fastest sorting algorithms (QuickSort) takes this very same approach.
So, distributed computing had other advantages: making use of multiple servers running the same applications on a subset of the data, to speed processing and/or to handle larger data set processing using divide-and-conquer techniques.
Another use of multiple servers with the same set of applications is used by each of us, each day. When we query our favorite web search provider by using their web server-based application, our connection to the service goes to a server that does "load balancing." This load balancing divides the incoming user connections and requests to multiple servers in a facility that are executing the search engine applications for the web search provider – each with access to their web search databases and identical application services. Here, the handling of large volumes of user requests are distributed to computing devices that run in parallel on different servers, which appear to be one "service" by the client user.
The next step in the evolution of distributed processing is to make data processing infrastructure components (servers, applications, and networks) not only in parallel for a given company facility in a server farm, but to distribute the resources globally and even across service providers. Companies needing fast computing for very large data volumes will enter into service-level agreement (SLA) contracts with Internet service providers (ISPs), which have very large server farms distributed throughout a country, continent, and globally. These processing centers may be contracted out to vendors which work for the ISP.
The data processing using cloud resources depends upon the following criteria:
How fast and how many processors are used.
The speed of any disk caching and physical disk media.
The speed of network connections between cloud resources and the Internet.
The regulation and load balancing of processing requests across available computing resources.
The confidentiality, availability, and integrity of cloud computing resources for both applications and data handling.
Any restrictions placed on the cloud processing based upon customer requirements to have data encrypted and/or processed only in certain geographic areas.
There are also three concepts of cloud resources and cloud architectures:
The private cloud – contained within the boundaries of a corporation or individual residence.
The public cloud – shared resources available to the public to contain and/ or process data (for storage, think of iTunes iCloud, DropBox, and Microsoft Live Drive).
The hybrid cloud – which uses a mix of public and private clouds to provide a solution for the user.
(Note: if you hear in various technology news articles the term "government cloud," that is simply a private cloud architecture owned by one or more government organizations).
The Conceptual Reference Model
The below diagram presents an overview of the NIST cloud computing reference architecture, which identifies the major actors, their activities and functions in cloud computing. The diagram depicts a generic high-level architecture and is intended to facilitate the understanding of the requirements, uses, characteristics and standards of cloud computing.
Source: NIST SP 500-292
The Cloud Conceptual Reference Model
The National Institute of Standards and Technology (NIST) gives this definition in the NIST Definition of Cloud Computing, Special Publication 800-145:
"Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models."
Service models are SaaS (software as a service), PaaS (platform as a service), and IaaS (infrastructure as a service). Architecture designs for cloud computing use different components, in part based on the services they are using in the cloud. Each level gives the user additional control.
Source: NIST SP 500-292
Source: NIST SP 500-292
Traditional Waterfall Model
The software system creation process which has six successive stages of development to deployment is known as the traditional waterfall model. A diagram of the names and flow of these stages is depicted in this diagram (analysis, requirements specification, design, implementation, testing and integration, and operations and maintenance):
Spiral design models have stages that reflect the waterfall model, except that there are multiple passes through those stages. In each pass, certain design and development criteria have been prioritized to work on in the process of creating the overall system. Spiral designs models may use two or as many spirals as are needed to create the system.
Agile/Extreme Programming Model
The agile design model is one of the category of extreme programming models. Agile is a popular approach today for systems engineering, software engineering, and hardware systems development efforts. It is characterized by having one or more teams work on prioritized components of a system to be developed. The prioritization may include factors of what is needed soonest in the system capabilities, and also which parts of the system implementation might be easiest to generate, and which components to be developed are deemed risky, not commercially available, and implementations never attempted before.
Agile is often a method chosen for smaller efforts in attempt to develop technologies for larger programs which are not readily available in the marketplace. Agile can also be used when rapid prototype and development of a system is needed to try in a field environment, as a proof of concept prior to building new systems based on its design.
Agile efforts are usually done in two-week to three-month "spins" or "increments." Some agile teams will have informal stand-up meetings each day (or at least once a week) to assess what team members tried, what worked, what needs a different approach, and how far along they are toward realizing a solution, or if another solution needs to be considered. There are less formal requirements documentation, test plans and procedures than would be in traditional or spiral approaches. Agile teams are more likely to capture design "issues" and needs using collaborative tools such as JIRA to keep track of requirements and progress.
Cloud-based development models have three approaches in development and providing of cloud based services:
Software (Applications) as a Service (SaaS) -- sometimes called "software on demand," is a licensing and delivery model in which centrally hosted software is licensed to subscribers. SaaS is typically accessed by users using a thin client (a computer built for remote access to a server) via a web browser. SaaS is commonly used in businesses for office and messaging software, payroll processing software, DBMS software, management software, CAD software, development software, gamification, virtualization, accounting, collaboration, customer relationship management (CRM), management information systems (MIS), enterprise resource planning (ERP), invoicing, human resource management (HRM), talent acquisition, content management (CM), antivirus software, and service desk management. SaaS has been incorporated into the strategy of nearly all leading enterprise software companies (Wikipedia, SaaS, n.d.).
Platform (Computing Servers) as a Service (PaaS) -- provides a cloud computing platform that allows clients to handling applications without having to build infrastructure to host and develop applications. PaaS provides the networks, servers, storage, OS, "middleware"(i.e.; java runtime, .net runtime, integration, etc.), database and other services to host the consumer's application (Wikipedia, PaaS, n.d.). The VM machines created for UMUC labs are an example of PaaS.
Infrastructure (Computing Resources, Networks) as a Service (IaaS) -- IaaS-cloud providers supply these resources on-demand from their large pools of equipment installed in data centers (Wikipedia, IaaS, n.d.).
Wikipedia. (n.d.) Software as a service. Retrieved from https://en.wikipedia.org/wiki/Software_as_a_service
Wikipedia. (n.d.) Platform as a service. Retrieved from https://en.wikipedia.org/wiki/Platform_as_a_service
Wikipedia. (n.d.). Infrastructure as a service. Retrieved from https://en.wikipedia.org/wiki/Cloud_computing#Infrastructure_as_a_service_.28IaaS.29
Cloud Computing Security
ENISA's Cloud Computing: Benefits, Risks and Recommendations for Information Security is a good report from the European Network Information Security Agency.
Other References of Interest in Cloud Computing Security:
NIST Cloud Computing Reference Architecture
Crafting and Implementing a Policy to Reduce Cyber Risks
Defense for Distributed Denial of Service Attacks in Cloud Computing
7 Security Measures to Protect Your Servers
This ends this part of the lab. Continue to the next part below.
Hands-On Cloud Environment Familiarization Exercise (Background Information)
Hadoop, by Hortonworks, is the basis of nearly all cloud-based environments today. Having some familiarity with it and how it works will be beneficial in your considerations of discussions and issues related to cloud-based security and awareness.
This lab introduces the Hortonworks Hadoop cloud environment. Hadoop is the foundation of all current cloud-based processing systems: Hortonworks, Cloudera, MapR, Azure, and Amazon Web Services (AWS).
This lab should be conducted by two of your group members, so that your team can obtained "lessons learned" regarding the nature and use of cloud computing. This knowledge will be useful when your team considers the processing environment and security issues that exist with the Cloud-based Model in providing computing solutions in industry.
While the tutorials may have a number of steps, it is not too difficult to follow the tutorials used. It will take some time to go through the tutorials, which is why only two team members should be dedicated to performing that exercise and providing the group feedback on their experiences. However, as you have time, each team member is encouraged to try out the tutorials.
If you choose to host the VM Sandbox on your own machine you will need to have Oracle's VirtualBox, VMWare's VMPlayer or VMWare Workstation and understand how to configure the VirtualBox or VMWare tool to run the Sandbox VM. Also the Hadoop Sandbox VM requires at least 8GB of RAM and the tool hosting the VM (VirtualBox or VMWare) may require an additional 2GB of RAM. The more RAM the better the response and performance of the lab. Both the Hadoop Sandbox VM and Oracle VirtualBox are free to download and use. instructions for downloading VirtualBox and Installing Hadoop VM is in Appendix B. Appendix D provides instructions on using VMWare.
If you have Oracle VirtualBox, VMWare's VMPlayer or VM Workstation, you might also download the Sandbox to your personal computer to use it also. It can be found at: http://hortonworks.com/hadoop-tutorial/learning-the-ropes-of-the-hortonworks-sandbox/?_bt=104841495198&_bk=hortonworks%20sandbox&_bm=e&_bn=g&gclid=COD9qLyf6c8CFdZZhgodejoK3g
In case you want to later place the Sandbox VM on your personal computer, the download URL is: http://hortonworks.com/products/hortonworks-sandbox/#install
"Your Goal and Objective, should you decide to undertake this cloud-based mission…"
The Sandbox tutorials are tried and tested by many. They will not "self-destruct in five minutes" (have fun learning and experimenting with the cloud environment).
We will walk through one of the Hortonworks tutorial exercises so that you can gain familiarity with the cloud environment and how it works. These can be done by reviewing the tutorials or completing the tutorial with Hadoop Sandbox loaded on your own PC as illustrated in Appendix C and Appendix D.
Go to the "Hello World" Hortonworks tutorial area and review/complete the hands-on tutorial "Step 1: Learning the Ropes of the Hortonworks Sandbox." It can be found at http://hortonworks.com/hadoop-tutorial/learning-the-ropes-of-the-hortonworks-sandbox/.
Next review/complete the hands-on tutorial introduction, concepts and Lab 1 of "Step 2: Hadoop Tutorial – Getting Started with HDP." These can be found at http://hortonworks.com/hadoop-tutorial/hello-world-an-introduction-to-hadoop-hcatalog-hive-and-pig/.
The Hadoop Data File System is known as HDFS. There are elements of the labs in Step 2 that demonstrate how to move data in and out of the cloud environment. Review/complete as many hands-on labs in Step 2 of the Hortonworks tutorial as you can to become familiar with the Hadoop cloud environment. There are various applications developed for the Hadoop cloud environment. Some work on raw data sets (such as ETL, Spark, Apache NiFi, and Zeppelin), and others are SQL or No-SQL database applications, such as HBase, HIVE and PIG. The tutorial will use some of these applications.
To learn more about security in Hadoop, review some of the information under the security tutorials. If time allows, try to review/complete some of the tutorials. The following are some suggested topics:
Securing your Data Lake Resource & auditing User Access with HDP Advanced Security
Securing HDFS, Hive and HBase with Knox and Ranger
Tag based policies with Apache Ranger and Apache Atlas
Securing your Hadoop Infrastructure with Apache Knox
Securing JDBC and ODBC Clients Access to HiveServer2 using Apache Knox
Fine-Grained Permissions for HDFS Files in Hadoop using HDFS ACLs
These can be found at http://hortonworks.com/hadoop-tutorial/securing-data-lake-auditing-user-access-using-hdp-security/.
Collect your data and prepare your report.
Appendix C, "Setting Up Oracle VirtualBox and Hadoop Sandbox on a Personal Computer," contains material from Set Up the Mininet Network Simulator by Brian Linkletter, which is used under the Creative Commons Attribution-NonCommercial 4.0 International license.
Setting Up Oracle VirtualBox and Hadoop Sandbox on a Personal Computer
If you decide to perform the Hadoop Sandbox exercises hands-on, you need to download and configure both Oracle VirtualBox and then the Hadoop Sandbox. First, download Oracle VirtualBox VM application.
Go to https://www.virtualbox.org/wiki/Downloads and select the appropriate version of VirtualBox binaries software for your personal computer. VirtualBox is free and can run on various systems. Download and then click on the install file to load VirtualBox.
Follow the prompts to finish the install.
Download the Hadoop Sandbox VM and store it on your local PC. The Hadoop Sadnbox can be downloaded for free at: http://hortonworks.com/downloads/#sandbox. Be sure to Select Hortonworks Sandbox on a VM. Select the HDP on Hortonworks Sandbox version for VirtualBox.
Start VirtualBox on your PC by double-clicking on the virtualbox icon. Next, create a version of the Hadoop Sandbox virtual machine that will run in VirtualBox by importing the Hadoop Sandbox virtual machine into the VirtualBox program.
Start the VirtualBox manager application on your host system.
Figure 1. VirtualBox Manager
Next, import the Hadoop virtual machine by using the VirtualBox menu command:
File → Import Appliance
In the next screen, click the "Open appliance" button.
Navigate to the folder containing the HDP_2.5_virtualbox.ova or similar file and select it.
Figure 2. Import Virtual Appliance Screen
Then, click the "Continue" button to get to the Appliance Settings screen. Use the default settings, but you can change the virtual machine's name, if you wish. I recommend changing the name from vm to Hadoop. Click on the "Import" button.
Figure 3. VM Settings
After a few minutes, you will see the Hadoop VM you imported in the VirtualBox window.
Now you must create a "host only" network interface in VirtualBox. This creates a loopback interface on the host computer that can be used to connect the virtual machine to the host computer (or to other virtual machines).
Open the VirtualBox preferences panel. Use the VirtualBox menu command:
VirtualBox → Preferences.
Figure 4. VirtualBox Network Section for Host-Only Networks
Click on the "Network" icon in the Preferences panel. Then, click on the small green "plus" sign on the right side of the window to add a new network adapter. An adapter called will be created. The default settings should be acceptable.
Figure 5. Setting Host-Only Network for VM
Check the settings by clicking on the small "screwdriver" icon on the right side of the window to edit the adapter's configuration. Make a note of the IP address. In this case, the default IP address used by VirtualBox for the first host-only adapter is 192.168.56.1/24.
Figure 6. VM adapter settings
The DHCP server is enabled on the interface and we see that the Lower Address Bound is 192.168.56.101/24. So, we know that the IP address of the virtual interface connected to the host-only network on the virtual machine will be assigned that IP address.
Figure 7. VM DHCP Server Settings
For future use, note the following information:
Virtual Machine's virtual interface IP address on host-only network: 192.168.56.101/24
Now, add a network adapter to Hadoop Sandbox virtual machine. In the VirtualBox Manager window, click on the Hadoop virtual machine and then click on the "Settings" icon on the top of the window. Click on the "Network" icon in the settings panel that appears. The virtual machine already has one interface defined. On the "Adapter 1" tab, we see an interface set up as a NAT.
Figure 8. Network Adapter 1 Settings
This will allow the virtual machine to connect to the Internet. But to use Hadoop, we still need a way for the virtual machine to connect directly to the host computer. So, we need to add another virtual adapter and connect it to the "host-only network" interface we created earlier.
Click on the "Adapter 2" tab and, in the "Attached to:" field, select "Host-only network."This allows other programs running on your host computer to connect to the VM using SSH. Since only one host-only network is currently created, VirtualBox will automatically select the vboxnet0 host-only network.
Figure 8. Setting Network Adapter 2 for Host-Only Settings
Click the "OK" button. Now the network settings are configured for the Hadoop Sandbox virtual machine. You may change some of the other settings if you want to but the default values for all other settings will work well.
Now start Hadoop VM. In the VirtualBox manager, select the Hadoop virtual machine and then click the "Start" button to start the Hadoop VM.
Figure 9. VirtualBox Manager
The VM will boot up and present you with a login prompt.
Instructions for using VMWare and the Hadoop Tutorial
These are alternative instructions for completing the lab hands-on using VMWare. These are for doing the lab on your own computer. The following are the steps for installing VMWare and then Hadoop tutorials.
Download the VM Player from the website: http://www.vmware.com/products/player/playerpro-evaluation.html
Click on the Windows version Download Now link, save the file (if using Firefox, it will be saved into your Downloads folder). Click on the download to begin installation.
Accept and select Next again…
Select Next again…
Select Next again…
Select Next again…
Allow it to try to upgrade, so that it will install.
If you have an earlier version, it will uninstall that, and then install the latest version.
Click Finish to exit the installation. The free version does not require a license.
You should see a VMWare Workstation icon on your desktop. Click on it to run it.
Enter a valid e-mail address. Then, you will be able to select Continue to finish installation of the free version. The VM Player will run. If you had a prior installation, you will see any existing VMs that are already installed.
To run an existing VM, in the left pane, just double-click on it.
Note: When you are in a VM, to move the cursor mouse back to the host OS, click CTRL and ALT together. To move the cursor back inside the VM window, double-click inside the VM window.
There are two primary ways to install new VMs: 1) from the ISO image file (usually done for an operating system to be installed, like Linux or MS-Window; you can also install from an install CD/DVD disk); and, 2) installing a previously created VM. For method one, click on Create a New Virtual Machine. For method two, click on Open a Virtual Machine. Once created, the VM will show up in the list of installed VMs in the left pane.
Two Examples for Loading the Hadoop Sandbox VMs
Example 1: Hortonworks Hadoop VM Sandbox Installation
Here is an example of installation of Hortonworks Hadoop VM Sandbox, as downloaded from the source webpage http://hortonworks.com/products/sandbox/. The latest Hadoop Cloud Sandbox upon download is a file called: HDP_2.5_vmware.ova. We will use VM Player to Open a Virtual Machine, then browse to the location of this file that was downloaded from the sandbox website.
Then import the following screen.
The VM will install into your VM Player.
Once installed, highlight (select) the Hadoop (HDP) VM in the left pane, and then click on the green triangle to Play virtual machine. Notice it tells you this VM will require 8GB of physical memory on your machine.
If you see this software updates message, you can click on Remind Me Later (we do not need this to continue).
This Hadoop VM uses CentOS Linux 7 as its operating system. It will load when the VM starts, and it may take some time.
Example 2: MapR Hadoop VM Sandbox Installation
This example will show you how to install the MapR version of the Hadoop sandbox. The source for the MapR Sandbox is located at the webpage: https://www.mapr.com/products/mapr-sandbox-hadoop/download Download for VMWare VM. The latest Hadoop Cloud Sandbox upon download is a file called: MapR-Sandbox-For-Hadoop-5.2.0-vmware.ova. We will use VM Player to Open a Virtual Machine, then browse to the location of this file that was downloaded from the sandbox website.
Then Import on the following screen:
The VM will install into your VM Player.
Once installed, highlight (select) the MapR Hadoop (HDP) VM in the left pane, and then click on the green triangle to Play virtual machine. Notice it tells you this VM will require 6GB of physical memory on your machine.
If you see this software updates message, you can click on Remind Me Later (we do not need this to continue).
This MapR Hadoop VM uses CentOS Linux 6.7 as its operating system. It will load when the VM starts. It may take some time.