SGI 10-Gigabit Ethernet Network Adapter Bedienungsanleitung

Stöbern Sie online oder laden Sie Bedienungsanleitung nach Vernetzung SGI 10-Gigabit Ethernet Network Adapter herunter. InfiniBand and 10-Gigabit Ethernet for Dummies Benutzerhandbuch

  • Herunterladen
  • Zu meinen Handbüchern hinzufügen
  • Drucken
  • Seite
    / 150
  • Inhaltsverzeichnis
  • LESEZEICHEN
  • Bewertet. / 5. Basierend auf Kundenbewertungen
Seitenansicht 0
Designing Cloud and Grid Computing Systems
with InfiniBand and High-Speed Ethernet
Dhabaleswar K. (DK) Panda
The Ohio State University
http://www.cse.ohio-state.edu/~panda
A Tutorial at CCGrid ’11
by
Sayantan Sur
The Ohio State University
E-mail: sur[email protected]-state.edu
http://www.cse.ohio-state.edu/~surs
Seitenansicht 0
1 2 3 4 5 6 ... 149 150

Inhaltsverzeichnis

Seite 1 - A Tutorial at CCGrid ’11

Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed EthernetDhabaleswar K. (DK) PandaThe Ohio State UniversityE-mail: panda@cse.

Seite 2

Hadoop Architecture• Underlying Hadoop Distributed File System (HDFS)• Fault-tolerance by replicating data blocks• NameNode: stores information on dat

Seite 3 - Computing Systems

CCGrid '11OpenFabrics Stack with Unified Verbs InterfaceVerbs Interface(libibverbs)Mellanox(libmthca)QLogic(libipathverbs)IBM (libehca)Chelsio(li

Seite 4 - Cluster Computing Environment

• For IBoE and RoCE, the upper-level stacks remain completely unchanged• Within the hardware:– Transport and network layers remain completely unchange

Seite 5 - (http://www.top500.org)

CCGrid '11OpenFabrics Software StackSA Subnet AdministratorMAD Management DatagramSMA Subnet Manager AgentPMA Performance Manager AgentIPoIB IP o

Seite 6 - Grid Computing Environment

CCGrid '11103InfiniBand in the Top500Percentage share of InfiniBand is steadily increasing

Seite 7

45%43%6%1%0%0%1%0%0%0%4%Number of SystemsGigabit Ethernet InfiniBandProprietary MyrinetQuadrics Mixed NUMAlink SP Switch Cray Interconnect Fat Tree Cu

Seite 8 - Compute cluster

105InfiniBand System Efficiency in the Top500 ListCCGrid '1101020304050607080901000 50 100 150 200 250 300 350 400 450 500Efficiency (%)Top 500 S

Seite 9 - Cloud Computing Environments

• 214 IB Clusters (42.8%) in the Nov ‘10 Top500 list (http://www.top500.org)• Installations in the Top 30 (13 systems):CCGrid '11Large-scale Infi

Seite 10 - Hadoop Architecture

• HSE compute systems with ranking in the Nov 2010 Top500 list– 8,856-core installation in Purdue with ConnectX-EN 10GigE (#126)– 7,944-core installat

Seite 11 - Memcached Architecture

• HSE has most of its popularity in enterprise computing and other non-scientific markets including Wide-area networking• Example Enterprise Computing

Seite 12

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Seite 13 - • Software components

Memcached Architecture• Distributed Caching Layer– Allows to aggregate spare memory from multiple nodes– General purpose• Typically used to cache data

Seite 14 - • Ex: TCP/IP, UDP/IP

Modern Interconnects and Protocols110ApplicationVerbsSocketsApplicationInterfaceTCP/IPHardwareOffloadTCP/IPEthernetDriverKernelSpaceProtocolImplementa

Seite 15 - – Not scalable:

• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf

Seite 16 - Myrinet (1993 -) 1 Gbit/sec

CCGrid '11112Low-level Latency Measurements051015202530VPI-IBNative IBVPI-EthRoCESmall MessagesLatency (us)Message Size (bytes)010002000300040005

Seite 17

CCGrid '11113Low-level Uni-directional Bandwidth Measurements02004006008001000120014001600VPI-IBNative IBVPI-EthRoCEUni-directional BandwidthBand

Seite 18

• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf

Seite 19 - IB Trade Association

• High Performance MPI Library for IB and HSE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2)– Used by more than 1,550 organizations in 60 countries– More tha

Seite 20

CCGrid '11116One-way Latency: MPI over IB0123456Small Message LatencyMessage Size (bytes)Latency (us)1.961.541.602.17050100150200250300350400MVAP

Seite 21 - • I/O interface bottlenecks

CCGrid '11117Bandwidth: MPI over IB0500100015002000250030003500Unidirectional BandwidthMillionBytes/secMessage Size (bytes)2665.63023.71901.11553

Seite 22

CCGrid '11118One-way Latency: MPI over iWARP0102030405060708090Chelsio (TCP/IP)Chelsio (iWARP)Intel-NetEffect (TCP/IP)Intel-NetEffect (iWARP)Mess

Seite 23

CCGrid '11119Bandwidth: MPI over iWARP0200400600800100012001400Message Size (bytes)Unidirectional BandwidthMillionBytes/sec839.81169.7373.31245.0

Seite 24 - (not shown)

• Good System Area Networks with excellent performance (low latency, high bandwidth and low CPU utilization) for inter-processor communication (IPC) a

Seite 25

CCGrid '11120Convergent Technologies: MPI Latency0102030405060Small MessagesLatency (us)Message Size (bytes)0200040006000800010000120001400016000

Seite 26

CCGrid '11121Convergent Technologies:MPI Uni- and Bi-directional Bandwidth02004006008001000120014001600Native IBVPI-IBVPI-EthRoCEUni-directional

Seite 27 - • Myricom GM

• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf

Seite 28 - IB Hardware Acceleration

CCGrid '11123IPoIB vs. SDP Architectural ModelsTraditional ModelPossible SDP ModelSockets AppSockets APISockets ApplicationSockets APIKernelTCP/I

Seite 29 - • Hardware Checksum Engines

CCGrid '11124SDP vs. IPoIB (IB QDR)050010001500200028321285122K8K32KBandwidth (MBps)IPoIB-RCIPoIB-UDSDP0510152025302481632641282565121K2KLatency

Seite 30 - TOE and iWARP Accelerators

• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf

Seite 31

• Option 1: Layer-1 Optical networks– IB standard specifies link, network and transport layers– Can use any layer-1 (though the standard says copper a

Seite 32

Features• End-to-end guaranteed bandwidth channels• Dynamic, in-advance, reservation and provisioning of fractional/full lambdas• Secure control-plane

Seite 33

• Supports SONET OC-192 or 10GE LAN-PHY/WAN-PHY• Idea is to make remote storage “appear” local• IB-WAN switch does frame conversion– IB standard allow

Seite 34 - 2003 (Gen1), 2007 (Gen2)

CCGrid '11129InfiniBand Over SONET: Obsidian Longbows RDMAthroughput measurements over USNLinuxhostORNL700 milesLinuxhostChicagoCDCISeattleCDCISu

Seite 35

• Hardware components– Processing cores and memory subsystem– I/O bus or links– Network adapters/switches• Software components– Communication stack• B

Seite 36 - IB, HSE and their Convergence

CCGrid '11130IB over 10GE LAN-PHY and WAN-PHYLinuxhostORNL700 milesLinuxhostSeattleCDCIORNLCDCIlongbowIB/SlongbowIB/S3300 miles 4300 milesORNL lo

Seite 37 - Traditional Ethernet

MPI over IB-WAN: Obsidian RoutersDelay (us) Distance (km)10 2100 201000 20010000 2000Cluster ACluster BWAN LinkObsidian WAN Router Obsidian WAN Router

Seite 38 - IB Overview

Communication Options in Grid• Multiple options exist to perform data transfer on Grid• Globus-XIO framework currently does not support IB natively• W

Seite 39 - Components: Channel Adapters

Globus-XIO Framework with ADTS DriverGlobus XIO Driver #nDataConnectionManagementPersistentSessionManagementBuffer &FileManagementData Transport I

Seite 40 - • Switches: intra-subnet

134Performance of Memory BasedData Transfer• Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files• ADTS

Seite 41 - – Not directly addressable

135Performance of Disk Based Data Transfer• Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files• Predic

Seite 42

136Application Level Performance050100150200250300CCSMUltra-VizBandwidth (MBps)Target ApplicationsADTSIPoIB• Application performance for FTP getopera

Seite 43 - IB Communication Model

• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf

Seite 44 - Queue Pair Model

A New Approach towards OFA in CloudCurrent ApproachTowards OFA in CloudApplicationAccelerated Sockets10 GigE or InfiniBandVerbs / Hardware OffloadCurr

Seite 45 - Memory Registration

Memcached Design Using Verbs• Server and client perform a negotiation protocol– Master thread assigns clients to appropriate worker thread• Once a cli

Seite 46 - Memory Protection

• Ex: TCP/IP, UDP/IP• Generic architecture for all networks• Host processor handles almost all aspects of communication– Data buffering (copies on sen

Seite 47 - (Send/Receive Model)

Memcached Get Latency• Memcached Get latency– 4 bytes – DDR: 6 us; QDR: 5 us– 4K bytes -- DDR: 20 us; QDR:12 us• Almost factor of four improvement ove

Seite 48 - Hardware ACK

Memcached Get TPS• Memcached Get transactions per second for 4 bytes– On IB DDR about 600K/s for 16 clients – On IB QDR 1.9M/s for 16 clients• Almost

Seite 49

Hadoop: Java Communication Benchmark• Sockets level ping-pong bandwidth test• Java performance depends on usage of NIO (allocateDirect)• C and Java ve

Seite 50

Hadoop: DFS IO Write Performance• DFS IO included in Hadoop, measures sequential access throughput• We have two map tasks each writing to a file of in

Seite 51 - Hardware Protocol Offload

Hadoop: RandomWriter Performance• Each map generates 1GB of random binary data and writes to HDFS• SSD improves execution time by 50% with 1GigE for t

Seite 52 - • Switching and Multicast

Hadoop Sort Benchmark• Sort: baseline benchmark for Hadoop• Sort phase: I/O bound; Reduce phase: communication bound• SSD improves performance by 28%

Seite 53 - Buffering and Flow Control

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Seite 54 - Virtual Lanes

• Presented network architectures & trends for Clusters, Grid, Multi-tier Datacenters and Cloud Computing Systems• Presented background and detail

Seite 55 - Service Levels and QoS

CCGrid '11Funding AcknowledgmentsFunding Support byEquipment Support by148

Seite 56 - Traffic Segregation Benefits

CCGrid '11Personnel AcknowledgmentsCurrent Students – N. Dandapanthula (M.S.)– R. Darbha (M.S.)– V. Dhanraj (M.S.)– J. Huang (Ph.D.)– J. Jose (P

Seite 57 - Identifiers)

• Traditionally relied on bus-basedtechnologies (last mile bottleneck)– E.g., PCI, PCI-X– One bit per wire– Performance increase through:• Increasing

Seite 58 - Switch Complex

CCGrid '11Web Pointershttp://www.cse.ohio-state.edu/~pandahttp://www.cse.ohio-state.edu/~surshttp://nowlab.cse.ohio-state.eduMVAPICH Web Pagehttp

Seite 59 - – 3D Torus (Sandia Red Sky)

• Network speeds saturated at around 1Gbps– Features provided were limited– Commodity networks were not considered scalable enough for very large-scal

Seite 60 - More on Multipathing

• Industry Networking Standards• InfiniBand and High-speed Ethernet were introduced into the market to address these bottlenecks• InfiniBand aimed at

Seite 61 - IB Multicast Example

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Seite 62

• IB Trade Association was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun)• Goal: To design a scalable and high

Seite 63 - IB Transport Services

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Seite 64 - Reliability

• 10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step• Goal: To achieve a scalable and high performanc

Seite 65 - Transport Layer Capabilities

• Network speed bottlenecks• Protocol processing bottlenecks• I/O interface bottlenecksCCGrid '1121Tackling Communication Bottlenecks with IB and

Seite 66 - Data Segmentation

• Bit serial differential signaling– Independent pairs of wires to transmit independent data (called a lane)– Scalable to any number of lanes– Easy to

Seite 67 - Transaction Ordering

CCGrid '11Network Speed Acceleration with IB and HSEEthernet (1979 - ) 10 Mbit/secFast Ethernet (1993 -) 100 Mbit/secGigabit Ethernet (1995 -) 10

Seite 68 - Message-level Flow-Control

2005 - 2006 - 2007 - 2008 - 2009 - 2010 - 2011Bandwidth per direction (Gbps)32G-IB-DDR48G-IB-DDR96G-IB-QDR48G-IB-QDR200G-IB-EDR112G-IB-FDR300G-IB-EDR1

Seite 69

• Network speed bottlenecks• Protocol processing bottlenecks• I/O interface bottlenecksCCGrid '1125Tackling Communication Bottlenecks with IB and

Seite 70

• Intelligent Network Interface Cards• Support entire protocol processing completely in hardware (hardware protocol offload engines)• Provide a rich c

Seite 71 - Concepts in IB Management

• Fast Messages (FM)– Developed by UIUC• Myricom GM– Proprietary protocol stack from Myricom• These network stacks set the trend for high-performance

Seite 72 - Subnet Manager

• Some IB models have multiple hardware accelerators– E.g., Mellanox IB adapters• Protocol Offload Engines– Completely implement ISO/OSI layers 2-4 (l

Seite 73

• Interrupt Coalescing– Improves throughput, but degrades latency• Jumbo Frames– No latency impact; Incompatible with existing switches• Hardware Chec

Seite 74 - HSE Overview

CCGrid '11Current and Next Generation Applications and Computing Systems3• Diverse Range of Applications– Processing and dataset characteristics

Seite 75 - Differences

• TCP Offload Engines (TOE)– Hardware Acceleration for the entire TCP/IP stack– Initially patented by Tehuti Networks– Actually refers to the IC on th

Seite 76 - – Multi Stream Semantics

• Also known as “Datacenter Ethernet” or “Lossless Ethernet”– Combines a number of optional Ethernet standards into one umbrella as mandatory requirem

Seite 77

• Network speed bottlenecks• Protocol processing bottlenecks• I/O interface bottlenecksCCGrid '1132Tackling Communication Bottlenecks with IB and

Seite 78

• InfiniBand initially intended to replace I/O bus technologies with networking-like technology– That is, bit serial differential signaling– With enha

Seite 79

• Recent trends in I/O interfaces show that they are nearly matching head-to-head with network speeds (though they still lag a little bit)CCGrid &apos

Seite 80

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Seite 81

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics– Novel Features– Subnet Management and Services• High-spee

Seite 82

CCGrid '1137Comparing InfiniBand with Traditional Networking StackApplication LayerMPI, PGAS, File SystemsTransport LayerOpenFabrics VerbsRC (rel

Seite 83 - Offloaded TCP

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•

Seite 84

• Used by processing and I/O units to connect to fabric• Consume & generate IB packets• Programmable DMA engines with protection features• May hav

Seite 85 - Myrinet Express (MX)

CCGrid '11Cluster Computing EnvironmentCompute clusterLANFrontendMeta-DataManagerI/O ServerNodeMetaDataDataComputeNodeComputeNodeI/O ServerNodeDa

Seite 86 - Datagram Bypass Layer (DBL)

• Relay packets from a link to another• Switches: intra-subnet• Routers: inter-subnet• May support multicastCCGrid '11Components: Switches and Ro

Seite 87 - • Solarflare approach:

• Network Links– Copper, Optical, Printed Circuit wiring on Back Plane– Not directly addressable• Traditional adapters built for copper cabling– Restr

Seite 88

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•

Seite 89

CCGrid '11IB Communication ModelBasic InfiniBand Communication Semantics43

Seite 90 - Hardware

• Each QP has two queues– Send Queue (SQ)– Receive Queue (RQ)– Work requests are queued to the QP (WQEs: “Wookies”)• QP to be linked to a Complete Que

Seite 91 - IB Transport

1. Registration Request • Send virtual address and length2. Kernel handles virtual->physical mapping and pins region into physical memory• Process

Seite 92 - IB iWARP/HSE RoE RoCE

• To send or receive data the l_keymust be provided to the HCA• HCA verifies access to local memory• For RDMA, initiator must have the r_key for the r

Seite 93

CCGrid '11Communication in the Channel Semantics(Send/Receive Model)InfiniBand DeviceMemoryMemoryInfiniBand DeviceCQQPSend RecvMemorySegmentSend

Seite 94 - IB Hardware Products

CCGrid '11Communication in the Memory Semantics (RDMA Model)InfiniBand DeviceMemoryMemoryInfiniBand DeviceCQQPSend RecvMemorySegmentSend WQE cont

Seite 95 - Tyan Thunder S2935 Board

InfiniBand DeviceCCGrid '11Communication in the Memory Semantics (Atomics)MemoryMemoryInfiniBand DeviceCQQPSend RecvMemorySegmentSend WQE contain

Seite 96 - IB Hardware Products (contd.)

CCGrid '11Trends for Computing Clusters in the Top 500 List (http://www.top500.org)Nov. 1996: 0/500 (0%)Nov. 2001: 43/500 (8.6%)Nov. 2006: 361

Seite 97 - – Nortel Networks

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•

Seite 98 - • Support for VPI and RoCE

CCGrid '11Hardware Protocol OffloadComplete HardwareImplementationsExist51

Seite 99 - – OFED 1.6 is underway

• Buffering and Flow Control• Virtual Lanes, Service Levels and QoS• Switching and MulticastCCGrid '11Link/Network Layer Capabilities52

Seite 100 - (libibverbs)

• IB provides three-levels of communication throttling/control mechanisms– Link-level flow control (link layer feature)– Message-level flow control (t

Seite 101 - • Within the hardware:

• Multiple virtual links within same physical link– Between 2 and 16• Separate buffers and flow control– Avoids Head-of-Line Blocking• VL15: reserved

Seite 102 - OpenFabrics Software Stack

• Service Level (SL):– Packets may operate at one of 16 different SLs– Meaning not defined by IB• SL to VL mapping:– SL determines which VL on the nex

Seite 103 - InfiniBand in the Top500

• InfiniBand Virtual Lanes allow the multiplexing of multiple independent logical traffic flows on the same physical link• Providing the benefits of i

Seite 104 - SP Switch

• Each port has one or more associated LIDs (Local Identifiers)– Switches look up which port to forward a packet to based on its destination LID (DLID

Seite 105

• Basic unit of switching is a crossbar– Current InfiniBand products use either 24-port (DDR) or 36-port (QDR) crossbars• Switches available in the ma

Seite 106 - CCGrid '11

• Someone has to setup the forwarding tables and give every port an LID– “Subnet Manager” does this work• Different routing algorithms give different

Seite 107 - • Integrated Systems

CCGrid '11Grid Computing Environment6Compute clusterLANFrontendMeta-DataManagerI/O ServerNodeMetaDataDataComputeNodeComputeNodeI/O ServerNodeData

Seite 108 - Other HSE Installations

• Similar to basic switching, except…– … sender can utilize multiple LIDs associated to the same destination port• Packets sent to one DLID take a fix

Seite 109 - Presentation Overview

CCGrid '11IB Multicast Example61

Seite 110 - InfiniBand

CCGrid '11Hardware Protocol OffloadComplete HardwareImplementationsExist62

Seite 111 - Case Studies

• Each transport service can have zero or more QPs associated with it– E.g., you can have four QPs based on RC and one QP based on UDCCGrid '11IB

Seite 112 - Message Size (bytes)

CCGrid '11Trade-offs in Different Transport Types64AttributeReliableConnectionReliableDatagrameXtendedReliableConnectionUnreliableConnectionUnrel

Seite 113 - Bandwidth (MBps)

• Data Segmentation• Transaction Ordering• Message-level Flow Control• Static Rate Control and Auto-negotiationCCGrid '11Transport Layer Capabili

Seite 114

• IB transport layer provides a message-level communication granularity, not byte-level (unlike TCP)• Application can hand over a large message– Netwo

Seite 115 - MVAPICH/MVAPICH2 Software

• IB follows a strong transaction ordering for RC• Sender network adapter transmits messages in the order in which WQEs were posted• Each QP utilizes

Seite 116 - One-way Latency: MPI over IB

• Also called as End-to-end Flow-control– Does not depend on the number of network hops• Separate from Link-level Flow-Control– Link-level flow-contro

Seite 117 - Bandwidth: MPI over IB

• IB allows link rates to be statically changed– On a 4X link, we can set data to be sent at 1X– For heterogeneous links, rate can be set to the lowes

Seite 118

CCGrid '11Multi-Tier Datacenters and Enterprise Computing7...Enterprise Multi-tier DatacenterTier1Tier3Routers/ServersSwitchDatabase ServerAppli

Seite 119 - Bandwidth: MPI over iWARP

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•

Seite 120

• Agents– Processes or hardware units running on each adapter, switch, router (everything on the network)– Provide capability to query and set paramet

Seite 121 - Convergent Technologies:

Inactive LinksCCGrid '11Subnet ManagerActive LinksCompute NodeSwitchSubnet ManagerInactive LinkMulticast JoinMulticast SetupMulticast JoinMultica

Seite 122

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics– Novel Features– Subnet Management and Services• High-spee

Seite 123 - InfiniBand CA

• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an

Seite 124 - SDP vs. IPoIB (IB QDR)

CCGrid '11IB and HSE RDMA Models: Commonalities and DifferencesIB iWARP/HSEHardware Acceleration Supported SupportedRDMA Supported SupportedAtomi

Seite 125

• RDMA Protocol (RDMAP)– Feature-rich interface– Security Management• Remote Direct Data Placement (RDDP)– Data Placement and Delivery– Multi Stream S

Seite 126 - IB on the WAN

• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an

Seite 127 - Features

• Place data as it arrives, whether in or out-of-order• If data is out-of-order, place it at the appropriate offset• Issues from the application’s per

Seite 128 - “appear” local

• Part of the Ethernet standard, not iWARP– Network vendors use a separate interface to support it• Dynamic bandwidth allocation to flows based on int

Seite 129 - Sunnyvale

CCGrid '11Integrated High-End Computing EnvironmentsCompute clusterMeta-DataManagerI/O ServerNodeMetaDataDataComputeNodeComputeNodeI/O ServerNode

Seite 130 - 3300 miles 4300 miles

• Can allow for simple prioritization:– E.g., connection 1 performs better than connection 2– 8 classes provided (a connection can be in any class)• S

Seite 131 - Cluster B

• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an

Seite 132 - Communication Options in Grid

• Regular Ethernet adapters and TOEs are fully compatible• Compatibility with iWARP required• Software iWARP emulates the functionality of iWARP on th

Seite 133 - Modern WAN

CCGrid '11Different iWARP ImplementationsRegular Ethernet AdaptersApplicationHigh Performance SocketsSocketsNetwork AdapterTCPIPDevice DriverOffl

Seite 134 - Data Transfer

• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an

Seite 135 - IPoIB-64MB

• Proprietary communication layer developed by Myricom for their Myrinet adapters– Third generation communication layer (after FM and GM)– Supports My

Seite 136 - Application Level Performance

• Another proprietary communication layer developed by Myricom– Compatible with regular UDP sockets (embraces and extends)– Idea is to bypass the kern

Seite 137

CCGrid '11Solarflare Communications: OpenOnload Stack87Typical HPC Networking StackTypical Commodity Networking Stack• HPC Networking Stack provi

Seite 138

• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics– Novel Features– Subnet Management and Services• High-spee

Seite 139 - Memcached Design Using Verbs

• Single network firmware to support both IB and Ethernet• Autosensing of layer-2 protocol– Can be configured to automatically work with either IB or

Seite 140 - Memcached Get Latency

CCGrid '11Cloud Computing Environments9LANPhysical MachineVMVMPhysical MachineVMVMPhysical MachineVMVMVirtual FSMeta-DataMetaDataI/O ServerDataI/

Seite 141 - Memcached Get TPS

• Native convergence of IB network and transport layers with Ethernet link layer• IB packets encapsulated in Ethernet frames• IB network layer already

Seite 142 - Bandwidth with C version

• Very similar to IB over Ethernet– Often used interchangeably with IBoE– Can be used to explicitly specify link layer is Converged (Enhanced) Etherne

Seite 143

CCGrid '11IB and HSE: Feature ComparisonIB iWARP/HSE RoE RoCEHardware Acceleration Yes Yes Yes YesRDMA Yes Yes Yes YesCongestion Control Yes Opti

Seite 144

• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati

Seite 145 - Hadoop Sort Benchmark

• Many IB vendors: Mellanox+Voltaire and Qlogic– Aligned with many server vendors: Intel, IBM, SUN, Dell– And many integrators: Appro, Advanced Cluste

Seite 146

CCGrid '11Tyan Thunder S2935 Board(Courtesy Tyan)Similar boards from Supermicro with LOM features are also available 95

Seite 147 - Concluding Remarks

• Customized adapters to work with IB switches– Cray XD1 (formerly by Octigabay), Cray CX1• Switches:– 4X SDR and DDR (8-288 ports); 12X SDR (small si

Seite 148 - Funding Acknowledgments

• 10GE adapters: Intel, Myricom, Mellanox (ConnectX)• 10GE/iWARP adapters: Chelsio, NetEffect (now owned by Intel)• 40GE adapters: Mellanox ConnectX2-

Seite 149 - Personnel Acknowledgments

• Mellanox ConnectX Adapter• Supports IB and HSE convergence• Ports can be configured to support IB or HSE• Support for VPI and RoCE– 8 Gbps (SDR), 16

Seite 150 - Web Pointers

• Open source organization (formerly OpenIB)– www.openfabrics.org• Incorporates both IB and iWARP in a unified manner– Support for Linux and Windows–

Kommentare zu diesen Handbüchern

Keine Kommentare