Global Architecture and Technology Enablement PracticeHadoop with Kerberos – ArchitectureConsiderationsDocument Type: Best PracticeNote: The content of this paper refers exclusively to the secondmaintenance release (M2) of SAS 9.4.Contact InformationName: Stuart RogersName: Tom KeeferTitle: Principal Technical ArchitectTitle: Principal Solutions ArchitectPhone Number: 44 (0) 1628 490613Phone Number: 1 (919) 531-0850E-mail address: [email protected] address: [email protected]

Table of Contents1Introduction . 11.1Purpose of the Paper . 11.2Architecture Overview . 22Hadoop Security. 32.1Kerberos and Hadoop Authentication Flow. 43Architecture Considerations . 53.1SAS and Kerberos . 53.2User Repositories . 53.3Kerberos Distribution . 63.4Operating System Integration with Kerberos . 63.5Kerberos Topology . 73.5.1 SAS in the Corporate Realm . 73.5.2 SAS in the Hadoop Realm . 83.6Encryption Strength and Java . 84Example Authentication Flows: Single Realm . 104.1SAS DATA Step to Secure Hadoop. 104.2SAS Enterprise Guide to Secure Hadoop . 114.3SAS High-Performance Analytics . 125Questions That Must be Addressed . 135.1SAS Software Components . 135.2Users . 135.3Hadoop Nodes and SAS Nodes . 136References . 14

7Recommended Reading . 158Credits and Acknowledgements . 146

HADOOP WITH KERBEROS - ARCHITECTURE CONSIDERATIONS1 IntroductionNote: The content of this paper refers exclusively to the second maintenance release (M2) ofSAS Purpose of the PaperThis paper addresses the architecture considerations for setting up secure Hadoop environments withSAS products and solutions. Secure Hadoop refers to a deployment of Hadoop in environments whereKerberos has been enabled to provide strong authentication.This paper includes the questions that you must address early in the design of your targetenvironment. Responses to these questions will direct the deployment and configuration of the SASproducts and solutions. The details of SAS deployment are outside the scope of this document and arecovered in the Deployment Considerations document.Using Kerberos with Hadoop does not necessarily mean that Kerberos will be used to authenticateusers into the SAS part of the environment. The Kerberos authentication takes place between SASand Hadoop. (You can use Kerberos between the client and SAS to provide end-to-end Kerberosauthentication. But this, too, is outside the scope of this document.)In the secure Hadoop environment, SAS interacts in a number of ways. First, SAS code can bewritten to use SAS/ACCESS to Hadoop. This can make use of the LIBNAME statement or PROCHadoop statement. The LIBNAME statement can connect directly to HDFS, to HIVE, or to HIVEServer 2. This SAS code can be processed interactively or in batch, or it can be distributed with SASGrid Manager.SAS In-Memory solutions can leverage a SAS High-Performance Analytics Environment and connectto the secure Hadoop environment. The SAS High-Performance Analytics nodes can connect inparallel to the secure Hadoop environment to process data. This connection can again be directly toHDFS, via HIVE, or via HIVE Server2.1

HADOOP WITH KERBEROS - ARCHITECTURE CONSIDERATIONSThe first section of the paper provides a high-level overview of a secure Hadoop environment. Thefollowing sections address architecture considerations.1.2 Architecture Overview SAS does not directly process Kerberos tickets. It relies on the underlying operating systemand APIs. The operating system of SAS hosts must be integrated into the Kerberos realm structure ofthe secure Hadoop environment. A user repository that is valid across all SAS and Hadoop hosts is recommended rather thanthe use of local accounts. SAS does not directly interact with Kerberos. Microsoft Active Directory, MIT Kerberos, orHeimdal Kerberos can be used. The SAS process, either Java or C, must have access to the user’s Ticket Granting Ticket(TGT) via the Kerberos credentials cache. The SAS Java process needs the addition of the Unlimited Strength Encryption Policy files towork with 256-bit AES encryption.2

HADOOP WITH KERBEROS - ARCHITECTURE CONSIDERATIONS2 Hadoop SecurityHadoop Security is an evolving field with most major Hadoop distributors developing competingprojects. Some examples of such projects are Cloudera Sentry and Hortonworks Knox Gateway. Acommon feature of these security projects is that they are based on having Kerberos enabled for theHadoop environment.The non-secure configuration relies on client-side libraries to send the client-side credentials asdetermined from the client-side operating system as part of the protocol. While not secure, thisconfiguration is sufficient for many deployments that rely on physical security. Authorization checksthrough ACLs and file permissions are still performed against the client-supplied user ID.After Kerberos is configured, Kerberos authentication is used to validate the client-side credentials.This means that the client must request a Service Ticket valid for the Hadoop environment and submitthis Service Ticket as part of the client connection. Kerberos provides strong authentication in whichtickets are exchanged between client and server. Validation is provided by a trusted third party in theform of the Kerberos Key Distribution Center.To create a new Kerberos Key Distribution Center specifically for the Hadoop environment, followthe standard instructions from the Cloudera or Hortonworks results. See the following figure.The Kerberos Key Distribution Center is used to authenticate both users and server processes. Forexample, the Cloudera 4.5 management tools include all the required scripts that are needed toconfigure Cloudera to use Kerberos. When you want Cloudera to use Kerberos, run these scripts afteryou register an administrator principal. This process can be completed in minutes after the KerberosKey Distribution Center has been installed and configured.3

HADOOP WITH KERBEROS - ARCHITECTURE CONSIDERATIONS2.1 Kerberos and Hadoop Authentication FlowThe process flow for Kerberos and Hadoop authentication is shown in the diagram below. The firststep, where the end user obtains a Ticket-Granting Ticket (TGT), does not necessarily occurimmediately before the second step where the Service Tickets are requested. There are differentmechanisms that can be used to obtain the TGT. Some users run a kinit command after accessing themachine running the Hadoop clients. Others integrate the Kerberos configuration in the host operatingsystem setup. In this case, the action of logging on to the machine that runs the Hadoop clients willgenerate the TGT.After the user has a Ticket-Granting Ticket, the client application access to Hadoop Services initiatesa request for the Service Ticket (ST) that corresponds to the Hadoop Service the user is accessing.The ST is then sent as part of the connection to the Hadoop Service. The corresponding HadoopService must then authenticate the user by decrypting the ST using the Service Key exchanged withthe Kerberos Key Distribution Center. If this decryption is successful the end user is authenticated tothe Hadoop Service.4

HADOOP WITH KERBEROS - ARCHITECTURE CONSIDERATIONS3 Architecture ConsiderationsThe architecture for a secure Hadoop environment will include various SAS software products andsolutions. At the time of writing the products and solutions covered are as follows: SAS/ACCESS to Hadoop SAS High-Performance Analytics SAS Visual Analytics and SAS Visual Statistics3.1 SAS and KerberosSAS does not manage Kerberos ticket caches, nor does it directly request Kerberos Tickets. This is animportant factor when you are considering how SAS will interact with a secure Hadoop environment.Some software vendors maintain their own ticket cache and deal with requesting Kerberos ticketsdirectly. SAS does not do this. It relies on the underlying operating system and APIs to manage theKerberos ticket caches and requests. By definition, there can be a delay between the initialauthentication process with the Kerberos Key Distribution Center (KDC) and any subsequent requestfor a Service Ticket (ST). The initial Ticket Granting Ticket (TGT) must be put somewhere, so it isput is the ticket cache. In Windows environments, this a memory location. On most UNIX operatingsystems, this will be a file. Alternative configurations are possible with Windows to switch to using afile-based ticket cache.If the SAS process cannot access the ticket cache, then the process cannot use the TGT to request anST. There are two types of SAS processes that need access to the ticket cache. The first is launchedby SAS Foundation when processing a Hadoop LIBNAME statement. The second is launched by aSAS High-Performance Analytics Environment when an In-Memory Solution attempts to accessHadoop. Both of these processes must be able to access the ticket cache.The following sections detail the architecture considerations for initializing these Kerberos ticketcaches via the request for a TGT and then making them available to the SAS process.3.2 User RepositoriesIn a secure Hadoop environment, the strong authentication provided by Kerberos means thatprocesses will run as individual users across the Hadoop environment. Local user accounts can beused, but maintaining these accounts across a large number of hosts increases the chance for error.Therefore, it is recommended that you use a user repository to provide a central store for user detailsabout the environment. This can either be an isolated user repository specifically for the Hadoop5

HADOOP WITH KERBEROS - ARCHITECTURE CONSIDERATIONSenvironment or the general corporate user repository. Knowing what type of user repository is beingused is important for the configuration of the operating system across the environment.The user repository can be LDAP or Active Directory. The benefit of using Active Directory is thatthis includes all of the Kerberos Key Distribution Center infrastructure. If you use an LDAPrepository, you will have to use a separate implementation of the Kerberos Key Distribution Center.One drawback to using Active Directory is that the domain database does not normally store therequired POSIX user attributes. These attributes will be required for all users of the secure Hadoopenvironment. These POSIX user attributes are required because the users will running operatingsystem processes on the secure Hadoop environment. Microsoft provides details of mechanisms touse to store the POSIX attributes in the Active Directory.3.3 Kerberos DistributionYou have three main options when it comes to the distribution of Kerberos used in the environment.The first option, if Active Directory is used as the user repository, is to use the Microsoftimplementation of Kerberos, which is fully integrated into Active Directory. Alternatively, if anLDAP repository is used, either the MIT or Heimdal distributions of Kerberos can be used. SAS isagnostic to the distribution of Kerberos.3.4 Operating System Integration with KerberosAs stated above SAS does not directly interact with the Kerberos Key Distribution Center (KDC) andinitiate ticket requests. SAS operates through the standard GSSAPI and operating systems calls.Therefore, a key prerequisite is for the operating system to be correctly integrated with your chosenuser repository and Kerberos distribution. There are many different ways this can be accomplishedand SAS does not require any specific mechanism be used. The only requirements are that a TicketGranting Ticket (TGT) is generated as part of the user’s session initialization and that this TGT ismade available via the ticket cache.All hosts that run SAS Foundation for SAS/ACCESS to Hadoop processing must be integrated withKerberos. If you have SAS Grid Manager licensed, all grid nodes accessing the secure Hadoopenvironment must be integrated with Kerberos. For SAS High-Performance Analytics Environmentsall the nodes in the environment must be integrated with Kerberos and the SSH intercommunicationmust use Kerberos rather than SSH keys. In addition, in the SAS High-Performance AnalyticsEnvironment, the SAS Foundation hosts must also be integrated with Kerberos because they willinitially run the Hadoop LIBNAME statement.6

HADOOP WITH KERBEROS - ARCHITECTURE CONSIDERATIONS3.5 Kerberos TopologyThe key consideration for the integration of the operating systems with the Kerberos deployment forthe secure Hadoop environment is where the different components are located. You can place theservers into different domains and those domains might or might not reflect the Ke