Dryad/html/3596a79f-0714-43b0-b49a-ea9...

29 lines
6.9 KiB
HTML

<html><head><meta http-equiv="X-UA-Compatible" content="IE=edge" /><link rel="shortcut icon" href="../icons/favicon.ico" /><style type="text/css">.OH_CodeSnippetContainerTabLeftActive, .OH_CodeSnippetContainerTabLeft,.OH_CodeSnippetContainerTabLeftDisabled { }.OH_CodeSnippetContainerTabRightActive, .OH_CodeSnippetContainerTabRight,.OH_CodeSnippetContainerTabRightDisabled { }.OH_footer { }</style><link rel="stylesheet" type="text/css" href="../styles/branding.css" /><link rel="stylesheet" type="text/css" href="../styles/branding-en-US.css" /><script type="text/javascript" src="../scripts/branding.js"> </script><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>Running a DryadLINQ job on HDInsight</title><meta name="Language" content="en-us" /><meta name="Microsoft.Help.Id" content="3596a79f-0714-43b0-b49a-ea9eeccb7326" /><meta name="Description" content="The process for running a DryadLINQ application on HDInsight 3.0 is a bit complicated. This is because HDInsight does not expose all of the &quot;raw&quot; Hadoop 2.2 protocols to clients outside the cluster." /><meta name="Microsoft.Help.ContentType" content="How To" /><meta name="BrandingAware" content="true" /></head><body onload="OnLoad('cs')"><input type="hidden" id="userDataCache" class="userDataStyle" /><div class="OH_outerDiv"><div class="OH_outerContent"><table class="TitleTable"><tr><td class="OH_tdTitleColumn">Running a DryadLINQ job on HDInsight</td><td class="OH_tdRunningTitleColumn">DryadLINQ documentation</td></tr></table><div id="mainSection"><div id="mainBody"><span class="introStyle"></span><div class="introduction"><p>The process for running a DryadLINQ application on HDInsight 3.0 is a bit complicated. This is because
HDInsight does not expose all of the "raw" Hadoop 2.2 protocols to clients outside the cluster. In particular,
the only way to launch a job on a cluster is using the <a href="http://people.apache.org/~thejas/templeton_doc_latest/index.html" target="_blank">Templeton</a> REST APIs, as nicely wrapped up in the <a href="http://hadoopsdk.codeplex.com/" title="Optional alternate text" target="_blank">Microsoft .NET SDK for Hadoop</a>. Unfortunately, right now Templeton does not support native YARN applications like DryadLINQ, and so
the only jobs that may be launched from outside the cluster are Hadoop 1 jobs (MapReduce, HIVE, Pig, and so on).
</p></div><h3 class="procedureSubHeading">What happens when your client program runs a job</h3><div class="subSection"><ol><li><p>The client DryadLINQ program determines all of the resources that will be needed in the job. It
checks to see if they are already present on the cluster (using a hash of the binary) and uploads any that
are not present. They are uploaded to the default cluster storage account, so that Hadoop 2.2 services like
YARN will be able to read them using wasb. (See <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/" title="Optional alternate text" target="_blank">Using Azure Blob storage with HDInsight</a> for an explanation of how wasb/hdfs interacts with Azure blob storage.)</p></li><li><p>The client serializes a description of the DryadLINQ YARN application into an XML file. This file contains
a list of the resources that the DryadLINQ Application Master needs in order to run, and a command line for the
application master. (See <a href="http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/" target="_blank">YARN concepts</a> for an explanation of application masters.) This XML file is uploaded to the cluster's
default container as <em>user/&lt;yourUserName&gt;/staging/&lt;jobGuid&gt;.xml.&lt;hash&gt;</em>.</p></li><li><p>The client calls the .NET Hadoop SDK to run a Hadoop Streaming job using the above XML file as input.</p></li><li><p>The .NET SDK calls the Templeton REST API on your cluster.</p></li><li><p>The Templeton REST server launches a MapReduce job called <span class="command">TempletonControllerJob</span> on
your cluster.</p></li><li><p>The controller job launches a second MapReduce job called <span class="command">streamjob&lt;someNumber&gt;.jar</span>
on your cluster.</p></li><li><p>The streaming job reads the XML serialized above, and launches the DryadLINQ YARN application master, which
then actually runs your program. The title of the DryadLINQ application is <span class="command">DryadLINQ.App</span> by
default, but you can set it to something more friendly using the <span class="code">JobFriendlyName</span> property
of the <span class="code">DryadLinqContext</span>.</p></li><li><p>The streaming job writes the YARN application Id for the DryadLINQ application back to the cluster's default
container as <em>user/&lt;yourUserName&gt;/staging/&lt;jobGuid&gt;/part.00000</em>.</p></li><li><p>The DryadLINQ application writes heartbeat, logging and status information into a container called
<em>dryad-jobs/&lt;yarn-application-id&gt;</em> in the cluster's default storage account.</p></li><li><p>The client code reads the application id from <em>user/&lt;yourUserName&gt;/staging/&lt;jobGuid&gt;/part.00000</em>
and then monitors <em>dryad-jobs/&lt;yarn-application-id&gt;</em> to get updates on the progress of the job.
This is also where the job browser gets its information about the job.</p></li></ol><p>If you <a href="http://azure.microsoft.com/en-us/documentation/articles/hdinsight-administer-use-management-portal/" target="_blank">Enable Remote Desktop on your HDInsight cluster</a>, and click on the <span class="command">Hadoop YARN Status</span> shortcut link on the desktop, you can see all these
jobs running.</p><p>Unfortunately because of the current configuration of HDInsight clusters, all DryadLINQ logs are archived immediately
when the application exits, and you will get a "Failed redirect for container" error if you try to navigate to the logs of
a completed application. We have tried to report errors in user application code back so that they are visible in the
<a href="91822db3-8a00-4307-ad8a-595c94f449b0.htm">DryadLINQ Job Browser</a> to avoid the need to consult
the logs. If you do need to consult the logs, remote desktop to your HDInsight cluster, then click on the
<span class="command">Hadoop Command Line</span> link on the desktop, and then run a command similar to
<span class="command">yarn logs -applicationId &lt;APPLICATION_ID&gt; -appOwner &lt;CLUSTER_USER_NAME&gt;</span> where you replace
&lt;APPLICATION_ID&gt; and &lt;CLUSTER_USER_NAME&gt; with values appropriate to your job and cluster configuration.
</p><p><span class="media"><img alt="Dryad on Azure Architecture" src="../media/Dryad on Azure Architecture.png" /></span></p></div></div></div></div></div><div id="OH_footer" class="OH_footer" /></body></html>