tensorflow节点布放（device assignment of node）算法：simpler

tensorflow v0.9中目前在用的devcie assignment算法是simple placer算法，相比于白皮书中cost model算法实现简单。simpler placer算法优先选择/gpu:0设备，但不支持 multi gpu assignment。

白皮书提到的cost model可以根据设备资源代价、数据传输代价平衡分配设备，在v0.9版本中有部分实现，但还未开放使用，见 core/graph/costmodel.cc

simple_placer的实现代码在文件python/core/common_runtime/simple_placer.cc，其中包含device_assignment的核心功能。

core/common_runtime/simple_placer_test.cc测试片段如下

 ////////////////////////////////////////////////////////////////////////////////

 //

 // A SimplePlacerTest method has three phases:

 //

 // 1. Build a TensorFlow graph, with no (or partial) device assignments.

 // 2. Attempt to compute a placement using the SimplePlacer.

 // 3. EITHER: test that the constraints implied by the graph are respected;

 //    or that an appropriate error was reported.

 //

 ////////////////////////////////////////////////////////////////////////////////

 class SimplePlacerTest : public ::testing::Test {

  protected:

   SimplePlacerTest() {

     // Build a set of 10 GPU and 10 CPU devices.

     // NOTE: this->local_devices_ owns the device objects;

     // this->devices_ contains borrowed pointers to the device

     // objects.

     for (int i = ; i < ; ++i) {    // 添加了10 cpu和10 gpu的fake devices

       local_devices_.emplace_back(FakeDevice::MakeCPU(

           strings::StrCat("/job:a/replica:0/task:0/cpu:", i)));

       devices_.AddDevice(local_devices_.back().get());

       // Insert the GPUs in reverse order.

       local_devices_.emplace_back(FakeDevice::MakeGPU(

           strings::StrCat("/job:a/replica:0/task:0/gpu:",  - i)));

       devices_.AddDevice(local_devices_.back().get());

     }

   }

   ...

 }

 ...

 // Test that a graph with no constraints will successfully assign nodes to the

 // "best available" device (i.e. prefer GPU over CPU).

 TEST_F(SimplePlacerTest, TestNoConstraints) {

   Graph g(OpRegistry::Global());

   {  // Scope for temporary variables used to construct g.   // 用GraphDefBuilder构建graph的结构

     GraphDefBuilder b(GraphDefBuilder::kFailImmediately);

     Node* input = ops::SourceOp("TestInput", b.opts().WithName("in"));

     ops::UnaryOp("TestRelu", ops::NodeOut(input, ), b.opts().WithName("n1"));

     ops::UnaryOp("TestRelu", ops::NodeOut(input, ), b.opts().WithName("n2"));

     TF_EXPECT_OK(BuildGraph(b, &g));   //  BuildGraph函数将GraphDefBuilder的图写入到Graph中

   }

   TF_EXPECT_OK(Place(&g));   // Place函数将graph中的node布放到设备列表中

   EXPECT_DEVICE_TYPE(g, "in", DEVICE_CPU);   // 期望：input节点在CPU中，n1节点在GPU中，n2节点在GPU中，故而GPU优先级大于CPU

   EXPECT_DEVICE_TYPE(g, "n1", DEVICE_GPU);

   EXPECT_DEVICE_TYPE(g, "n2", DEVICE_GPU);

 }

其中BuildGraph函数将GraphDefBuilder 对象中的graph 结构定义写入到Graph中。Place函数将graph中的node布放到设备列表中，其中device assignment算法的核心在SimplePlacer::Run函数中

  // Builds the given graph, and (if successful) indexes the node

   // names for use in placement, and later lookup.

   Status BuildGraph(const GraphDefBuilder& builder, Graph* out_graph) {

     TF_RETURN_IF_ERROR(builder.ToGraph(out_graph));

     nodes_by_name_.clear();

     for (Node* node : out_graph->nodes()) {

       nodes_by_name_[node->name()] = node->id();

     }

     return Status::OK();

   }

   // Invokes the SimplePlacer on "graph". If no DeviceSet is specified, the

   // placement will use the default DeviceSet (of 10 CPU and 10 GPU devices).

   //

   // REQUIRES: "*graph" was produced by the most recent call to BuildGraph.

   Status Place(Graph* graph, DeviceSet* devices, SessionOptions* options) {

     SimplePlacer placer(graph, devices, options);

     return placer.Run();

   }

SimplePlacer::Run()在core/common_runtime/simple_placer.cc文件中，具体实现分为4个步骤：

步骤1和2：遍历graph的node，将node加入到ColocationGraph对象中（不包含source和sink节点）。

 // 1. First add all of the nodes. Note that steps (1) and (2)

 // requires two passes over the nodes because the graph (and hence

 // the constraints) may not be acyclic.  这里graph可能是有环的？

 for (Node* node : graph_->nodes()) {

     // Skip the source and sink nodes.

     if (!node->IsOp()) { continue; }

     status = colocation_graph.AddNode(*node);

     if (!status.ok()) return AttachDef(status, node->def());

   }

 // 2. Enumerate the constraint edges, and use them to update the disjoint node set.         // disjoint set（并查集，即不相交的节点集合），一种树型数据结构，

 ...

 ColocationGraph maintains the connected components of a colocation constraint graph, and uses this information to assign a satisfying device placement to the nodes of the graph.

 The implementation uses the union- find algorithm to maintain the connected components efficiently and incrementally as edges (implied by ColocationGraph::ColocateNodes() invocations) are added.

 参考：并查集wiki

步骤3：如下图和code所示，source和sink节点分配在cpu上，已指定device的节点不再重新分配。分配方式有方面，见Heuristic A和Heuristic B。

  . For each node, assign a device based on the constraints in thedisjoint node set.

   std::vector<Device*> devices;

   std::vector<Node*> second_pass;

   for (Node* node : graph_->nodes()) {

     // Skip the source and sink nodes.

     if (!node->IsOp()) {

       continue;

     }

     // Skip nodes that already have an assigned name.

     if (!node->assigned_device_name().empty()) {

       continue;

     }

     // Heuristic A: prefer to place "generators" with their only

     // consumers.

     //

     // If this is a node with no inputs and a single (non-ref)

     // consumer, we save this for a second pass, so that the

     // consumer's placement is chosen.

     if (IsGeneratorNode(node)) {    // generator node: no input, one output, not a reference-type node

       second_pass.push_back(node);

       continue;

     }

     status = colocation_graph.GetDevicesForNode(node, &devices);

     ...

     // Returns the first device in sorted devices list so we will always

     // choose the same device.

     //

     // TODO(vrv): Factor this assignment out into a pluggable

     // algorithm, so that SimplePlacer is responsible for enforcing

     // preconditions and we can experiment with other algorithms when

     // given a choice of devices. Once we have a better idea of the

     // types of heuristics we want to use and the information needed

     // to perform good placement we can add an interface for this.

     string assigned_device = devices[]->name();

     // Heuristic B: If the node only operates on metadata, not data,

     // then it is desirable to place that metadata node with its

     // input.

     if (IsMetadataNode(node)) {

       // Make sure that the input device type is in the list of supported

       // device types for this node.

       const Node* input = (*node->in_edges().begin())->src();

       // TODO(vrv): if the input is empty, consider postponing this

       // node's assignment to the second pass, so that we handle the

       // case where a metadata node's input comes from a backedge

       // of a loop.

       const string& input_device_name = input->assigned_device_name();

       if (CanAssignToDevice(input_device_name, devices)) {

         assigned_device = input_device_name;

       }

     }

     AssignAndLog(assigned_device, node);   // 将assigned_device分配个node节点，在步骤3中没有对符合Heuristic A的GeneratorNode分配设备，而是在步骤4中完成的

   }

 bool IsGeneratorNode(const Node* node) {

   return node->num_inputs() ==  && node->num_outputs() ==  && node->out_edges().size() ==  && !IsRefType(node->output_type());

 }

 bool IsMetadataNode(const Node* node) {

   const string& node_type = node->type_string();

   return (node_type == "Size" || node_type == "Shape" || node_type == "Rank");

 }

步骤4：给步骤3中的Generator Node分配device。

// 4. Perform a second pass assignment for those nodes explicitly skipped during the first pass.

...

部分参考：

http://bettercstomorrow.com/2016/07/14/distributed-tensorflow-internal-architecture-summary/

http://bettercstomorrow.com/2016/07/06/distributed-tensorflow-internal-architecture-6/ （韩文的-_-）

”tensorflow: large-scale machine learning on heterogeneous distributed systems“

来自为知笔记(Wiz)

巴特西

tensorflow节点布放（device assignment of node）算法：simpler_placer

最新文章

热门文章