29 lines
6.9 KiB
HTML
29 lines
6.9 KiB
HTML
<html><head><meta http-equiv="X-UA-Compatible" content="IE=edge" /><link rel="shortcut icon" href="../icons/favicon.ico" /><style type="text/css">.OH_CodeSnippetContainerTabLeftActive, .OH_CodeSnippetContainerTabLeft,.OH_CodeSnippetContainerTabLeftDisabled { }.OH_CodeSnippetContainerTabRightActive, .OH_CodeSnippetContainerTabRight,.OH_CodeSnippetContainerTabRightDisabled { }.OH_footer { }</style><link rel="stylesheet" type="text/css" href="../styles/branding.css" /><link rel="stylesheet" type="text/css" href="../styles/branding-en-US.css" /><script type="text/javascript" src="../scripts/branding.js"> </script><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>Running a DryadLINQ job on HDInsight</title><meta name="Language" content="en-us" /><meta name="Microsoft.Help.Id" content="3596a79f-0714-43b0-b49a-ea9eeccb7326" /><meta name="Description" content="The process for running a DryadLINQ application on HDInsight 3.0 is a bit complicated. This is because HDInsight does not expose all of the "raw" Hadoop 2.2 protocols to clients outside the cluster." /><meta name="Microsoft.Help.ContentType" content="How To" /><meta name="BrandingAware" content="true" /></head><body onload="OnLoad('cs')"><input type="hidden" id="userDataCache" class="userDataStyle" /><div class="OH_outerDiv"><div class="OH_outerContent"><table class="TitleTable"><tr><td class="OH_tdTitleColumn">Running a DryadLINQ job on HDInsight</td><td class="OH_tdRunningTitleColumn">DryadLINQ documentation</td></tr></table><div id="mainSection"><div id="mainBody"><span class="introStyle"></span><div class="introduction"><p>The process for running a DryadLINQ application on HDInsight 3.0 is a bit complicated. This is because
|
|
HDInsight does not expose all of the "raw" Hadoop 2.2 protocols to clients outside the cluster. In particular,
|
|
the only way to launch a job on a cluster is using the <a href="http://people.apache.org/~thejas/templeton_doc_latest/index.html" target="_blank">Templeton</a> REST APIs, as nicely wrapped up in the <a href="http://hadoopsdk.codeplex.com/" title="Optional alternate text" target="_blank">Microsoft .NET SDK for Hadoop</a>. Unfortunately, right now Templeton does not support native YARN applications like DryadLINQ, and so
|
|
the only jobs that may be launched from outside the cluster are Hadoop 1 jobs (MapReduce, HIVE, Pig, and so on).
|
|
</p></div><h3 class="procedureSubHeading">What happens when your client program runs a job</h3><div class="subSection"><ol><li><p>The client DryadLINQ program determines all of the resources that will be needed in the job. It
|
|
checks to see if they are already present on the cluster (using a hash of the binary) and uploads any that
|
|
are not present. They are uploaded to the default cluster storage account, so that Hadoop 2.2 services like
|
|
YARN will be able to read them using wasb. (See <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/" title="Optional alternate text" target="_blank">Using Azure Blob storage with HDInsight</a> for an explanation of how wasb/hdfs interacts with Azure blob storage.)</p></li><li><p>The client serializes a description of the DryadLINQ YARN application into an XML file. This file contains
|
|
a list of the resources that the DryadLINQ Application Master needs in order to run, and a command line for the
|
|
application master. (See <a href="http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/" target="_blank">YARN concepts</a> for an explanation of application masters.) This XML file is uploaded to the cluster's
|
|
default container as <em>user/<yourUserName>/staging/<jobGuid>.xml.<hash></em>.</p></li><li><p>The client calls the .NET Hadoop SDK to run a Hadoop Streaming job using the above XML file as input.</p></li><li><p>The .NET SDK calls the Templeton REST API on your cluster.</p></li><li><p>The Templeton REST server launches a MapReduce job called <span class="command">TempletonControllerJob</span> on
|
|
your cluster.</p></li><li><p>The controller job launches a second MapReduce job called <span class="command">streamjob<someNumber>.jar</span>
|
|
on your cluster.</p></li><li><p>The streaming job reads the XML serialized above, and launches the DryadLINQ YARN application master, which
|
|
then actually runs your program. The title of the DryadLINQ application is <span class="command">DryadLINQ.App</span> by
|
|
default, but you can set it to something more friendly using the <span class="code">JobFriendlyName</span> property
|
|
of the <span class="code">DryadLinqContext</span>.</p></li><li><p>The streaming job writes the YARN application Id for the DryadLINQ application back to the cluster's default
|
|
container as <em>user/<yourUserName>/staging/<jobGuid>/part.00000</em>.</p></li><li><p>The DryadLINQ application writes heartbeat, logging and status information into a container called
|
|
<em>dryad-jobs/<yarn-application-id></em> in the cluster's default storage account.</p></li><li><p>The client code reads the application id from <em>user/<yourUserName>/staging/<jobGuid>/part.00000</em>
|
|
and then monitors <em>dryad-jobs/<yarn-application-id></em> to get updates on the progress of the job.
|
|
This is also where the job browser gets its information about the job.</p></li></ol><p>If you <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-administer-use-management-portal/" target="_blank">Enable Remote Desktop on your HDInsight cluster</a>, and click on the <span class="command">Hadoop YARN Status</span> shortcut link on the desktop, you can see all these
|
|
jobs running.</p><p>Unfortunately because of the current configuration of HDInsight clusters, all DryadLINQ logs are archived immediately
|
|
when the application exits, and you will get a "Failed redirect for container" error if you try to navigate to the logs of
|
|
a completed application. We have tried to report errors in user application code back so that they are visible in the
|
|
<a href="91822db3-8a00-4307-ad8a-595c94f449b0.htm">DryadLINQ Job Browser</a> to avoid the need to consult
|
|
the logs. If you do need to consult the logs, remote desktop to your HDInsight cluster, then click on the
|
|
<span class="command">Hadoop Command Line</span> link on the desktop, and then run a command similar to
|
|
<span class="command">yarn logs -applicationId <APPLICATION_ID> -appOwner <CLUSTER_USER_NAME></span> where you replace
|
|
<APPLICATION_ID> and <CLUSTER_USER_NAME> with values appropriate to your job and cluster configuration.
|
|
</p><p><span class="media"><img alt="Dryad on Azure Architecture" src="../media/Dryad on Azure Architecture.png" /></span></p></div></div></div></div></div><div id="OH_footer" class="OH_footer" /></body></html> |