Troubleshooting the Elastic Network Adapter (ENA)
The Elastic Network Adapter (ENA) is designed to improve operating system health and reduce the chances of long-term disruption because of unexpected hardware behavior and or failures. The ENA architecture keeps device or driver failures as transparent to the system as possible. This topic provides troubleshooting information for ENA.
If you are unable to connect to your instance, start with the Troubleshooting Connectivity Issues section.
If you are able to connect to your instance, you can gather diagnostic information by using the failure detection and recovery mechanisms that are covered in the later sections of this topic.
Contents
Troubleshooting Connectivity Issues
If you lose connectivity while enabling enhanced networking, the
ena
module might be incompatible with your instance's current
running kernel. This can happen if you install the module for a specific kernel version
(without dkms, or with an improperly configured
dkms.conf file) and then your instance kernel is updated. If the
instance kernel that is loaded at boot time does not have the ena
module properly installed, your instance will not recognize the network adapter and your
instance becomes unreachable.
If you enable enhanced networking for a PV instance or AMI, this can also make your instance unreachable.
If your instance becomes unreachable after enabling enhanced networking with ENA, you
can disable the enaSupport
attribute for your instance and it will
fall back to the stock network adapter.
To disable enhanced networking with ENA (EBS-backed instances)
From your local computer, stop the instance using the Amazon EC2 console or the following command: stop-instances (AWS CLI). If your instance is managed by AWS OpsWorks, you should stop the instance in the AWS OpsWorks console so that the instance state remains in sync.
Important
If you are using an instance store-backed instance, you can't stop the instance. Instead, proceed to To disable enhanced networking with ENA (instance store-backed instances).
From your local computer, disable the enhanced networking attribute using the following command.
modify-instance-attribute (AWS CLI)
$
aws ec2 modify-instance-attribute --instance-id
instance_id
--no-ena-support
From your local computer, start the instance using the Amazon EC2 console or the following command: start-instances (AWS CLI). If your instance is managed by AWS OpsWorks, you should start the instance in the AWS OpsWorks console so that the instance state remains in sync.
(Optional) Connect to your instance and try reinstalling the
ena
module with your current kernel version by following the steps in Enabling Enhanced Networking with the Elastic Network Adapter (ENA) on Linux Instances in a VPC.
To disable enhanced networking with ENA (instance store-backed instances)
If your instance is an instance store-backed instance, create a new AMI as
described in Creating an Instance Store-Backed Linux
AMI. Be sure to disable the
enhanced networking enaSupport
attribute when you register the
AMI.
register-image (AWS CLI)
$
aws ec2 register-image --no-ena-support ...
Keep-Alive Mechanism
The ENA device posts keep-alive events at a fixed rate (usually once every second). The ENA driver implements a watchdog mechanism, which checks every for the presence of these keep-alive messages. If a message or messages are present, the watchdog is rearmed, otherwise the driver concludes that the device experienced a failure and then does the following:
Dumps its current statistics to syslog
Resets the ENA device
Resets the ENA driver state
The above reset procedure may result in some traffic loss for a short period of time (TCP connections should be able to recover), but should not otherwise affect the user.
The ENA device may also indirectly request a device reset procedure, by not sending a keep-alive notification, for example, if the ENA device reaches an unknown state after loading an irrecoverable configuration.
Below is an example of the reset procedure:
[18509.800135] ena 0000:00:07.0 eth1: Keep alive watchdog timeout. // The watchdog process initiates a reset
[18509.815244] ena 0000:00:07.0 eth1: Trigger reset is on
[18509.825589] ena 0000:00:07.0 eth1: tx_timeout: 0 // The driver logs the current statistics
[18509.834253] ena 0000:00:07.0 eth1: io_suspend: 0
[18509.842674] ena 0000:00:07.0 eth1: io_resume: 0
[18509.850275] ena 0000:00:07.0 eth1: wd_expired: 1
[18509.857855] ena 0000:00:07.0 eth1: interface_up: 1
[18509.865415] ena 0000:00:07.0 eth1: interface_down: 0
[18509.873468] ena 0000:00:07.0 eth1: admin_q_pause: 0
[18509.881075] ena 0000:00:07.0 eth1: queue_0_tx_cnt: 0
[18509.888629] ena 0000:00:07.0 eth1: queue_0_tx_bytes: 0
[18509.895286] ena 0000:00:07.0 eth1: queue_0_tx_queue_stop: 0
.......
........
[18511.280972] ena 0000:00:07.0 eth1: free uncompleted tx skb qid 3 idx 0x7 // At the end of the down process, the driver discards incomplete packets.
[18511.420112] [ENA_COM: ena_com_validate_version] ena device version: 0.10 //The driver begins its up process
[18511.420119] [ENA_COM: ena_com_validate_version] ena controller version: 0.0.1 implementation version 1
[18511.420127] [ENA_COM: ena_com_admin_init] ena_defs : Version:[b9692e8] Build date [Wed Apr 6 09:54:21 IDT 2016]
[18512.252108] ena 0000:00:07.0: Device watchdog is Enabled
[18512.674877] ena 0000:00:07.0: irq 46 for MSI/MSI-X
[18512.674933] ena 0000:00:07.0: irq 47 for MSI/MSI-X
[18512.674990] ena 0000:00:07.0: irq 48 for MSI/MSI-X
[18512.675037] ena 0000:00:07.0: irq 49 for MSI/MSI-X
[18512.675085] ena 0000:00:07.0: irq 50 for MSI/MSI-X
[18512.675141] ena 0000:00:07.0: irq 51 for MSI/MSI-X
[18512.675188] ena 0000:00:07.0: irq 52 for MSI/MSI-X
[18512.675233] ena 0000:00:07.0: irq 53 for MSI/MSI-X
[18512.675279] ena 0000:00:07.0: irq 54 for MSI/MSI-X
[18512.772641] [ENA_COM: ena_com_set_hash_function] Feature 10 isn't supported
[18512.772647] [ENA_COM: ena_com_set_hash_ctrl] Feature 18 isn't supported
[18512.775945] ena 0000:00:07.0: Device reset completed successfully // The reset process is complete
Register Read Timeout
The ENA architecture suggests a limited usage of memory mapped I/O (MMIO) read operations. MMIO registers are accessed by the ENA device driver only during its initialization procedure.
If the driver logs (available in dmesg output) indicate failures of read operations, this may be caused by an incompatible or incorrectly compiled driver, a busy hardware device, or hardware failure.
Intermittent log entries that indicate failures on read operations should not be considered an issue; the driver will retry them in this case. However, a sequence of log entries containing read failures indicate a driver or hardware problem.
Below is an example of driver log entry indicating a read operation failure due to a timeout:
[ 47.113698] [ENA_COM: ena_com_reg_bar_read32] reading reg failed for timeout. expected: req id[1] offset[88] actual: req id[57006] offset[0]
[ 47.333715] [ENA_COM: ena_com_reg_bar_read32] reading reg failed for timeout. expected: req id[2] offset[8] actual: req id[57007] offset[0]
[ 47.346221] [ENA_COM: ena_com_dev_reset] Reg read32 timeout occurred
Statistics
If you experience insufficient network performance or latency issues, you should retrieve the device statistics and examine them. These statistics can be obtained using ethtool, as shown below:
[ec2-user ~]$ ethtool –S ethN
NIC statistics:
tx_timeout: 0
io_suspend: 0
io_resume: 0
wd_expired: 0
interface_up: 1
interface_down: 0
admin_q_pause: 0
queue_0_tx_cnt: 4329
queue_0_tx_bytes: 1075749
queue_0_tx_queue_stop: 0
...
The following command output parameters are described below:
tx_timeout:
N
The number of times that the
Netdev
watchdog was activated.io_suspend:
N
Unsupported. This value should always be zero.
io_resume:
N
Unsupported. This value should always be zero.
wd_expired:
N
The number of times that the driver did not receive the keep-alive event in the preceding 3 seconds.
interface_up:
N
The number of times that the ENA interface was brought up.
interface_down:
N
The number of times that the ENA interface was brought down.
admin_q_pause:
N
The admin queue is in an unstable state. This value should always be zero.
queue_
N
_tx_cnt:N
The number of transmitted packets for queue
N
.queue_
N
_tx_bytes:N
The number of transmitted bytes for queue
N
.queue_
N
_tx_queue_stop:N
The number of times that queue
N
was full and stopped.queue_
N
_tx_queue_wakeup:N
The number of times that queue
N
resumed after being stopped.queue_
N
_tx_dma_mapping_err:N
Direct memory access error count. If this value is not 0, it indicates low system resources.
queue_
N
_tx_napi_comp:N
The number of times the
napi
handler callednapi_complete
for queueN
.queue_
N
_tx_poll:N
The number of times the
napi
handler was scheduled for queueN
.queue_
N
_tx_doorbells:N
The number of transmission doorbells for queue
N
.queue_
N
_tx_linearize:N
The number of times SKB linearization was attempted for queue
N
.queue_
N
_tx_linearize_failed:N
The number of times SKB linearization failed for queue
N
.queue_
N
_tx_prepare_ctx_err:N
The number of times
ena_com_prepare_tx
failed for queueN
. This value should always be zero; if not, see the driver logs.queue_
N
_tx_missing_tx_comp:N
The number of packets that were left uncompleted for queue
N
. This value should always be zero.queue_
N
_tx_bad_req_id:N
Invalid
req_id
for queueN
. The validreq_id
is zero, minus thequeue_size
, minus 1.queue_
N
_rx_cnt:N
The number of received packets for queue
N
.queue_
N
_rx_bytes:N
The number of received bytes for queue
N
.queue_
N
_rx_refil_partial:N
The number of times the driver did not succeed in refilling the empty portion of the
rx
queue with the buffers for queueN
. If this value is not zero, it indicates low memory resources.queue_
N
_rx_bad_csum:N
The number of times the
rx
queue had a bad checksum for queueN
(only ifrx
checksum offload is supported).queue_
N
_rx_page_alloc_fail:N
The number of time that page allocation failed for queue
N
. If this value is not zero, it indicates low memory resources.queue_
N
_rx_skb_alloc_fail:N
The number of time that SKB allocation failed for queue
N
. If this value is not zero, it indicates low system resources.queue_
N
_rx_dma_mapping_err:N
Direct memory access error count. If this value is not 0, it indicates low system resources.
queue_
N
_rx_bad_desc_num:N
Too many buffers per packet. If this value is not 0, it indicates usage of very small buffers.
queue_
N
_rx_small_copy_len_pkt:N
Optimization: For packets smaller that this threshold, which is set by
sysfs
, the packet is copied directly to the stack to avoid allocation of a new page.ena_admin_q_aborted_cmd:
N
The number of admin commands that were aborted. This usually happens during the auto-recovery procedure.
ena_admin_q_submitted_cmd:
N
The number of admin queue doorbells.
ena_admin_q_completed_cmd:
N
The number of admin queue completions.
ena_admin_q_out_of_space:
N
The number of times that the driver tried to submit new admin command, but the queue was full.
ena_admin_q_no_completion:
N
The number of times that the driver did not get an admin completion for a command.
Driver Error Logs in syslog
The ENA driver writes log messages to syslog during system boot. You can examine these logs to look for errors if you are experiencing issues. Below is an example of information logged by the ENA driver in syslog during system boot, along with some annotations for select messages.
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 478.416939] [ENA_COM: ena_com_validate_version] ena device version: 0.10
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 478.420915] [ENA_COM: ena_com_validate_version] ena controller version: 0.0.1 implementation version 1
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.256831] ena 0000:00:03.0: Device watchdog is Enabled
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.672947] ena 0000:00:03.0: creating 8 io queues. queue size: 1024
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.680885] [ENA_COM: ena_com_init_interrupt_moderation] Feature 20 isn't supported // Interrupt moderation is not supported by the device
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.691609] [ENA_COM: ena_com_get_feature_ex] Feature 10 isn't supported // RSS HASH function configuration is not supported by the device
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.694583] [ENA_COM: ena_com_get_feature_ex] Feature 18 isn't supported //RSS HASH input source configuration is not supported by the device
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.697433] [ENA_COM: ena_com_set_host_attributes] Set host attribute isn't supported
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.701064] ena 0000:00:03.0 (unnamed net_device) (uninitialized): Cannot set host attributes
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 479.704917] ena 0000:00:03.0: Elastic Network Adapter (ENA) found at mem f3000000, mac addr 02:8a:3c:1e:13:b5 Queues 8
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 480.805037] EXT4-fs (xvda1): re-mounted. Opts: (null)
Jun 3 22:37:46 ip-172-31-2-186 kernel: [ 481.025842] NET: Registered protocol family 10
Which errors can I ignore?
The following warnings that may appear in your system's error logs can be ignored for the Elastic Network Adapter:
Set host attribute isn't supported
Host attributes are not supported for this device.
failed to alloc buffer for rx queue
This is a recoverable error, and it indicates that there may have been a memory pressure issue when the error was thrown.
Feature
X
isn't supportedThe referenced feature is not supported by the Elastic Network Adapter. Possible values for
X
include:10
: RSS Hash function configuration is not supported for this device.12
: RSS Indirection table configuration is not supported for this device.18
: RSS Hash Input configuration is not supported for this device.
20
: Interrupt moderation is not supported for this device.
Failed to config AENQ
The Elastic Network Adapter does not support AENQ configuration.
Trying to set unsupported AENQ events
This error indicates an attempt to set an AENQ events group that is not supported by the Elastic Network Adapter.