<?xml version="1.0" encoding="utf-8"?>
<?xml-model href="rfc7991bis.rnc"?>  

<!DOCTYPE rfc [
  <!ENTITY nbsp    "&#160;">
  <!ENTITY zwsp   "&#8203;">
  <!ENTITY nbhy   "&#8209;">
  <!ENTITY wj     "&#8288;">
]>

<rfc
  xmlns:xi="http://www.w3.org/2001/XInclude"
  category="std"
  docName="draft-song-cain-header-00"
  ipr="trust200902"
  submissionType="IETF"
  xml:lang="en"
  version="3">


  <front>
    <title abbrev="CAIN Header Compression">Network Header Compression for Converged AI Network</title>
 
    <seriesInfo name="Internet-Draft" value="draft-song-cain-header-00"/>
   
    <author fullname="Haoyu Song" initials="H" role="editor" surname="Song">
      <organization>Futurewei Technologies</organization>
      <address>
        <postal>
          <country>US</country>
        </postal>        
        <email>hsong@futurewei.com</email>  
      </address>
    </author>
	
	<author fullname="Keyi Zhu" initials="K." surname="Zhu">
      <organization>Huawei Technologies</organization>
      <address>
		<postal>
          <country>CN</country>
        </postal>  
        <email>zhukeyi@huawei.com</email>
      </address>
    </author>

	<author fullname="Jian Song" initials="J." surname="Song">
      <organization>China Mobile</organization>
      <address>
		<postal>
          <country>CN</country>
        </postal>  
        <email>songjianyjy@chinamobile.com</email>
      </address>
    </author>

    <area>General</area>
    <workgroup>Internet Engineering Task Force</workgroup>
    
    <keyword></keyword>
    

    <abstract>
      <t>We envision the scale-up, scale-out, and scale-across networks for AI computing would eventually converged. 
	     The draft describes a scheme for L3 packet header compression in converged AI networks 
	     where IPv6 are assumed to be the L3 protocol, and a unified fabric supports all kinds of traffic. 
         The header size can be reduced to 8 octets for packets transferred with a single super-node, representing 80% overhead saving. 
		 The document discusses the motivation, requirements, 
		 benefits, and feasibility in addition to the header format proposal.</t>
    </abstract>
 
  </front>

  <middle>
    
    <section>
      <name>Introduction</name>
	  
      <t>The AI scale-up network is shifting from proprietary solutions to standard Ethernet-based, driven by
	  several forces including vendor lock-in breaking, cost structure, and operational simplicity. 
	  Although in the mainstream the scale-up network and the scale-out network remain physically and semantically separated, there is not a fundamental barrier 
	  preventing the two from being bridged together (i.e., allowing direct packet forwarding between the two domains), or 
	  sharing the physical interfaces (i.e., mixing the traffic). The boundary is becoming blurry. Recent research <xref target="hot25"/> has proposed that, 
	  to support more flexible routing and load balancing, it is preferred to unify the scale-up domain and the scale-out domain.
	  There are industry practices on the horizon as well. For example, Intel's Gaudi 3 <xref target="gaudi"/> only provides 24 unified RoCEv2 ports,
	  removing the separation of the two domains altogether; Huawei's UBMesh <xref target="ub"/> uses unified bus to provide hierarchical 
	  interconnections extendable to multiple levels without distinguishing the two domains. </t>
	  
	  <t>Meanwhile, scale-across network is becoming the third pillar of AI infrastructure which extend the scale-out network across multiple AI data centers.
	  AI infrastructure is undergoing a paradigm shifts from super-node as a computer to datacenter as a computer to multi-datacenter as a computer.
	  In the converged AI network, packets can move between any two AI accelerator nodes regardless of their locations. It is desirable to have a common L3 protocol 
	  for the unified routing and forwarding functions within and among the domains. </t>

	  <t>On the other hand, the accelerator affinity in conventional scale-up domain allows data transactions with more efficient memory semantics (i.e., 
	  the nodes in the same domain can share the unified memory space), while the scale-out domain typically resorts to message semantics for data move (e.g., RDMA). The two 
	  domains can use very different protocol stacks. For example, the scale-up domain uses L2 switching only but the scale-out domain requires L3 routing; 
	  even with the unified Ethernet-based L2, the L4 transport protocol diverge again. To unify the two domains, and further extend to the scale-across
	  domain in the future, we need to introduce a unified L3 network protocol, based on the already unified Ethernet-based L2 link protocol, with
	  the coexistence of potentially multiple L4+ protocols. This is critical for enabling a unified AI fabric 
	  with the benefits of open ecosystem, low cost, and simplified operation. </t>  
		
	  <t>While IPv6 provides enough scalability and extensibility to support the converged AI network, its header overhead is too big for certain 
	  communication scenarios. For example, memory-semantic traffic （i.e., LD/ST) usually has the minimum sized payload; 
	  a large number of packets for signaling (e.g., ACK, CNP, barrier, trimmed packets) and for network control/management plane are also small. 
	  The base header of IPv6 is 40 bytes. When extension header is needed (e.g., SRv6), the size would be even greater. The L3 header poses a significant overhead to such packets.  
	  Given the bandwidth of AI network is always a precious resource and performance bottleneck,
	  it is critical to reduce the network header overhead yet maintain the benefits of scalability and extensibility. 
	  Therefore, we need an effective header compression scheme which is suitable for the converged AI network, and retain the compatibility
	  with standard IPv6 at the scale-across domain which shares the public WAN.</t>
	
	  <t>This document describes the Converged AI Network (CAIN) L3 header format. It is an IPv6 header compression scheme based on 
	  Short Hierarchical IP Address (SHIP) <xref target="I-D.song-ship-edge"/>. Within an AI DCN, it supports multiple hierarchical levels.
	  The simplest two-level form distinguish the scale-up and scale-out domains. It can also support more levels as described in UBMesh <xref target="ub"/>,
	  and other hierarchical topologies (e.g., rack, pod, super-pod, etc.). To support scale-across at the DCN gateway, the CAIN header are translated into standard IPv6 header format for WAN compatibility.  </t>
	
      <section>
        <name>Requirements Language</name>
        <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
          "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT
          RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
          interpreted as described in BCP 14 <xref target="RFC2119"/>
          <xref target="RFC8174"/> when, and only when, they appear in
          all capitals, as shown here.</t>
      </section>

    </section>
    
    <section>
      <name>Related Work</name>
	  
      <t>The related works and their limitations are summarized as follows.</t>
      
      <ol>
        <li>AFH: Broadcom's scale-up Ethernet framework specifies a compact AI Fabric Header (AFH) <xref target="afh"/>. However,
			it encodes the node address information in MAC header and only works in L2 scale-up domain, unsuitable to be used as the CAIN header.
		</li>
		<li>SUNH: The Internet draft <xref target="I-D.herbert-sunh" /> proposes an L3-based scale-up network header which supports L3 routing. 
			However, it is designed with fixed address size and for scale-up network only. Therefore, the flexibility and extensibility are limited. 
		</li>
		<li>IPHC and SCHC: IPv6 header compression schemes have been specified for some particular low power IoT networks 
			such as 6loWPAN <xref target="RFC6282"/> and LPWAN <xref target="RFC8724"/>. These networks feature low data rate and are insensitive to latency. 
			However, due to the low power constraint, they are extremely sensitive to bandwidth efficiency. 
			Therefore, they adopt the context-based compression schemes which, while needing extra storage and computation, 
			can reduce the header overhead to the utmost extend. In contrast, AI networks requires high bandwidth, low latency, and 
			low processing complexity which render these schemes unsuitable. 
		</li>
      </ol>
    </section>   
    
	<section>
	  <name> CAIN Header Format</name>
	  
	  <t>The proposed CAIN Header format is as follows.</t>
	  
 <artwork><![CDATA[	  
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Traffic Class |HopLim |              Flow Label               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Next Header   | SAL   | DAL   |  SA + DA (variable length)    | 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |  
                                                             
 ]]></artwork>
	
	  
	  <t>The traffic class, flow label, and next header fields are inherited from IPv6 without any change. 
		The hop limit field is reduced to 4 bits to support up to 15 hops, which is enough because the number 
		of hops in AI network is typically small (e.g., a 3-layer CLOS network has 5 hops at most). </t>
		
	  <t>
         In the CAIN header, no Version field is included; the protocol is identified
          by the EtherType value at the L2 layer. No Payload Length
          field is included; the payload length is derived from the
          L2 frame length minus the CAIN header length. The header
          length is deterministically computed as:
        </t>

 <artwork type="ascii-art"><![CDATA[
  header_length = ceil4(6 + SAL_bytes + DAL_bytes)

  where ceil4(x) = (x + 3) AND NOT(3)
        SAL_bytes = (SAL == 0) ? 16 : SAL
        DAL_bytes = (DAL == 0) ? 16 : DAL
]]></artwork>

	  <t>4-bit SAL and DAL indicate the source address (SA) and the destination address (DA)'s length in 8-bit steps.
		 For example, "0001" stands for 8, and "0010" stands for 16. Specifically, "0000" stands for 128, which means 
		 the corresponding address is a 128-bit IPv6 address.
		 Such an address allocation scheme allows the lowest-level scale-up network to have up to 256 accelerator nodes, well aligned 
		 with the current and future network scales. In such case, the CAIN header is only 8 bytes. 
		 (Note: a none-linear code-to-length mapping table can be specified to provide more flexible address length hierarchy. TBD.)</t>
	  
		
	  <t>The routing, forwarding, and other control plane provisions based on CAIN header is described in <xref target="I-D.song-ship-edge"/>. 
	    When accelerator nodes 
	    in the same scale-up network communicates, they always use the shortest addresses to keep the header overhead minimum.  
		When a packet crosses the level boundary, the router is responsible to augment or prune prefix to or from the addresses in the packet.
		At any location, the packet only carries the minimum address bits to allow unique source and destination identification. 
		Specifically, if a node sends a packet to another data center, at the data center boundary, the packet will be translated into 
		a standard IPv6 packet without any information loss. Such a design matches the network architecture well where the header overhead is small when the packet size is small.</t> 
	  
	  <section anchor="app-ldst" numbered="true">
        <name>CAIN Traffic and Header Overhead</name>

        <t>
          In CAIN fabrics where Ethernet carries
          both scale-up (load/store memory semantics) and scale-out
          (RDMA message semantics) traffic, the CAIN header provides
          significant bandwidth efficiency gains for fine-grained
          memory access operations.
        </t>

        <t>
          Load/store operations access data at cache-line granularity
          (typically 64 bytes). With a standard IPv6 + UDP + BTH
          (RoCEv2) header stack of 60 bytes, the protocol overhead
          for a 64-byte payload is approximately 48%. The CAIN header
          with SAL=1 and DAL=1 (intra-rack scale-up domain) reduces
          the header to 8 bytes, yielding a protocol overhead of
          12.5% -- a reduction factor of approximately 4x.
        </t>
      </section>

      <section anchor="app-hierarchy" numbered="true">
        <name>Hierarchy Mapping to Network Topology</name>

        <t>
          The SHIP hierarchy maps naturally to the physical topology
          of CAINs:
        </t>

        <artwork type="ascii-art"><![CDATA[
+------------+-------------+----------+--------+-----------------+
| SHIP Level | Fabric Tier | Address  | Typical| Dominant        |
|            |             | Length   | Scale  | Traffic Type    |
+------------+-------------+----------+--------+-----------------+
| L2 (leaf)  | Intra-node  | 1 byte   | 8-72   | LD/ST (memory   |
|            | scale-up    |          | GPUs   | semantics)      |
+------------+-------------+----------+--------+-----------------+
| L1 (mid)   | Intra-pod   | 2-3 byte | 100s-  | Mixed LD/ST     |
|            |             |          | 1000s  | and RDMA        |
+------------+-------------+----------+--------+-----------------+
| L0 (root)  | Cross-pod   | 4+ byte  | 10K+   | RDMA (message   |
|            | scale-out   |          | GPUs   | semantics)      |
+------------+-------------+----------+--------+-----------------+
| External   | Internet    | 16 byte  | global | IPv6            |
+------------+-------------+----------+--------+-----------------+
]]></artwork>

        <t>
          This mapping has a desirable property: the traffic type
          most sensitive to header overhead (LD/ST with small
          payloads) operates in the lowest hierarchy level where
          addresses are shortest. As traffic traverses higher levels
          of the hierarchy, payload sizes increase (RDMA bulk
          transfers for gradient synchronization), and the relative
          overhead of longer addresses diminishes.
        </t>
		
		<t>
          The following table illustrates the total header size for
          representative deployment scenarios. The baseline for
          comparison is the 40-byte IPv6 fixed header.
        </t>

        <artwork type="ascii-art"><![CDATA[
+---------------------+-----+-----+-------+--------+----------+
| Scenario            | SAL | DAL | Raw   | Padded | Savings  |
|                     |     |     | (B)   | (B)    | vs IPv6  |
+---------------------+-----+-----+-------+--------+----------+
| Intra-rack LD/ST    |  1  |  1  |   8   |    8   |   80%    |
| Intra-pod           |  2  |  2  |  10   |   12   |   70%    |
| Cross-pod           |  3  |  3  |  12   |   12   |   70%    |
| Cross-cluster       |  4  |  4  |  14   |   16   |   60%    |
| Edge-to-IPv6 (SA=4) |  4  |  0  |  26   |   28   |   30%    |
| Full IPv6 (both)    |  0  |  0  |  38   |   40   |    0%    |
+---------------------+-----+-----+-------+--------+----------+
]]></artwork>
		
		
      </section>
	
	
	</section>
	
	<section anchor="imp">
		<name>Implementation Considerations</name>
		<t> CAIN header-based packet forwarding needs new functions on L3 switches. 
			The cost analysis is given in appendix A which shows that the hardware cost is low, the throughput and latency performance is on par with the traditional L3 switch, 
			and the benefit is high. Specifically, the power and memory efficiency is even better than the conventional L3 switch due to the simplified table lookups.			
		</t>
	</section>
	
    <section anchor="IANA">
      <name>IANA Considerations</name>
      <t>This memo includes no request to IANA.</t>
    </section>
    
    <section anchor="Security">
      <name>Security Considerations</name>
      <t>TBD</t>
    </section>
    
  </middle>

  <back>
  
    <references>
      <name>References</name>
      <references>
        <name>Normative References</name>
        
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.2119.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8174.xml"/>
        <!-- The recommended and simplest way to include a well known reference -->
        
      </references>
 
      <references>
        <name>Informative References</name>
       
		<?rfc include='reference.I-D.herbert-sunh'?>
		<?rfc include='reference.I-D.song-ship-edge'?>
		<xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.6282.xml"/>
        <xi:include href="https://bib.ietf.org/public/rfc/bibxml/reference.RFC.8724.xml"/>
		
		<reference anchor="hot25" target="https://dl.acm.org/doi/epdf/10.1145/3772356.3772415">
		<front>
			<title>Your network doesn't end at the NIC: A case for unifying the inter-host and intra-host networks in (AI) datacenters</title>
			<author initials="R." surname="Joshi et al."/>
			<date year="2025"/>
		</front>
			<refcontent>24th ACM Workshop on Hot Topics in Networks</refcontent>
		</reference>


		<reference anchor="ub" target="https://www.computer.org/csdl/magazine/mi/2025/05/11150738/29JWPYIYbIc">
		<front>
			<title>UB-Mesh: A Hierarchically Localized nD-FullMesh Data Center Network Architecture</title>
			<author initials="H." surname="Liao et al."/>
			<date year="2025"/>
		</front>
			<refcontent>IEEE Micro</refcontent>
		</reference>
		
        <reference anchor="afh" target="https://docs.broadcom.com/doc/scale-up-ethernet-framework">
        <front>
            <title>Scale-Up Ethernet Framework Specification</title>
            <author>
              <organization>Broadcom</organization>
            </author>
            <date year="2025"/>
        </front>
        </reference>       
       
	    <reference anchor="gaudi" target="https://www.intel.com/content/www/us/en/content-details/817486/intel-gaudi-3-ai-accelerator-white-paper.html">
        <front>
            <title>Intel Gaudi 3 AI Accelerator White Paper</title>
            <author>
              <organization>Intel</organization>
            </author>
            <date year="2025"/>
        </front>
        </reference>     
	   
      </references>
    </references>


<!--    
    <section>
      <name>Appendix 1 [REPLACE/DELETE]</name>
      <t>This becomes an Appendix [REPLACE]</t>
    </section>

    <section anchor="Acknowledgements" numbered="false">
      <name>Acknowledgements</name>
      <t>This template uses extracts from templates written by Pekka Savola, Elwyn Davies and 
        Henrik Levkowetz. [REPLACE]</t>
    </section>
    
    <section anchor="Contributors" numbered="false">
      <name>Contributors</name>
      <t>Thanks to all of the contributors. [REPLACE]</t>
    </section>
-->

	<section anchor="appendix-hw-cost" numbered="true" toc="include">
      <name>Appendix A. Hardware Cost Analysis</name>

    <section anchor="appendix-pipeline" numbered="true" toc="include">
      <name>LGR Hardware Processing Pipeline</name>

      <t>
        This appendix describes a reference hardware pipeline
        architecture for a level-gateway switch (i.e., LGR in <xref target="I-D.song-ship-edge"/>) processing
        the CAIN header.
        The pipeline achieves line-rate forwarding with address
        augmentation and pruning in 5-6 clock cycles, comparable
        to standard IPv6 L3 switch pipelines.
      </t>


        <artwork type="ascii-art"><![CDATA[
  +-----------+   +-------------------+   +-----------+
  |  Stage 1  |-->|     Stage 2       |-->|  Stage 3  |
  |   Parse   |   | Extract + Resolve |   |  Lookup   |
  | (1 cycle) |   |    (1 cycle)      |   | (1-2 cyc) |
  +-----------+   +-------------------+   +-----------+
                                               |
  +-----------+   +-------------------+        |
  |  Stage 5  |<--|     Stage 4       |<-------+
  |   Emit    |   |   Header Edit     |
  | (1 cycle) |   |    (1 cycle)      |
  +-----------+   +-------------------+

  Total: 5-6 cycles at 1 GHz core clock = 5-6 ns latency
]]></artwork>
    
	</section>

    <section anchor="hw-comparison" numbered="true">
        <name>Comparison with Standard IPv6 Pipeline</name>

        <t>
          The following table compares the SHIP LGR pipeline with a
          standard IPv6 L3 switch pipeline across key implementation
          parameters.
        </t>

        <artwork type="ascii-art"><![CDATA[
+------------------------+--------------------+-------------------+
| Parameter              | Standard IPv6      | SHIP LGR          |
|                        | L3 Switch          | (4B-aligned)      |
+------------------------+--------------------+-------------------+
| Parse stages           | 1 cycle            | 1 cycle           |
| Direction/classify     | 1 cycle            | 1 cycle           |
| Forwarding lookup      | 1-2 cycles         | 1-2 cycles        |
| Header edit            | 1 cycle            | 1 cycle           |
| Emit                   | 1 cycle            | 1 cycle           |
+------------------------+--------------------+-------------------+
| Total pipeline depth   | 5-6 cycles         | 5-6 cycles        |
+------------------------+--------------------+-------------------+
| Lookup key width       | 128-bit (fixed)    | 8-128 bit (var)   |
| Lookup engine          | TCAM (LPM)         | SRAM (hash)       |
| Lookup power (relative)| ~10x               | ~1x               |
+------------------------+--------------------+-------------------+
]]></artwork>

        <t>
          The SHIP LGR pipeline is the same as the standard IPv6
          pipeline. The forwarding lookup is substantially
          more power-efficient because it uses SRAM-based hash tables
          instead of TCAM-based Longest Prefix Matching. In the most
          common intra-level forwarding case (SAL == DAL), the lookup
          key is only 1-4 bytes rather than the full 128-bit IPv6
          address, further reducing hash computation cost and SRAM
          access energy.
        </t>
      </section>
	  
	  <section anchor="app-latency" numbered="true">
        <name>Latency Considerations</name>

        <t>
          The 5-6 ns LGR pipeline latency is within the same order
          of magnitude as current Ethernet switch ASICs. For
          intra-level forwarding (the common case for LD/ST traffic),
          no address modification is performed, and the pipeline
          reduces to a simple hash-lookup-and-forward path.
        </t>

        <t>
          LGR address augmentation and pruning add no additional
          latency beyond the base pipeline, as these operations
          execute within the existing header edit stage. The
          latency impact is felt only at hierarchy boundaries
          (LGR hops), which coincide with the topology boundaries
          where additional switch hops would exist regardless of
          the addressing scheme.
        </t>
      </section>
	  
    </section>

	
 </back>
</rfc>