[xv6 source code snooping] network card driver

Posted by grglaz on Sun, 23 Jan 2022 22:59:28 +0100

preface

This article is about MIT 6 Implementation of s081-2020-lab11 (Networking);
If you find any mistakes in the content, please don't save your keyboard.

preparation

How to say, this last experiment... Is very... Interesting... (shame)

The test of this experiment is divided into two parts: one is the APR communication between the host and xv6 running qemu; Second, xv6 running qemu will send a query to Google's DNS server and print out after receiving the reply. Our experimental task is to complete the processing of Ethernet data frames on the device driver side.

The network protocol layer in xv6 is very simple. From top to bottom, there are application layer, transmission layer, network layer and data link layer. In this experiment, the Socket Layer has only one UDP; The network layer includes IP and ARP; The data link layer only focuses on the part that uploads the Ethernet frame load to the network layer. There will be a Packet Buffer between layers to cache the packets uploaded and transmitted between layers, so that the protocol entities in the layer can encapsulate and unpack the PDU.

The implementation of protocol entities in each layer are independent and concurrently callable components: interrupt handler, IP Processing Thread and network application. The network card converts the electrical signal or optical signal into a digital signal, turns it into a Packet, puts it into a memory queue built-in in the network card, and then sends an interrupt to the CPU; The interrupt handler will copy the Packet inside the network card to the Packet Buffer in RAM; IP Processing Thread is a kernel thread that has been running all the time. It can process packets in RAM and upload them to the Packet Buffer of Socket Layer; Finally, the Socket Layer processes the Packet and finally submits it to the Packet Buffer of the application. If the built-in queue of the network card is not included, in RAM, there are three Packet buffers for interaction only during the acceptance process.

We have seen the queue design in the previous disk interrupt handler. At that time, the queue design was proposed to decouple the interrupt handler from the disk. Of course, this design is very common and not unique to the disk interrupt handler. The network card plays a similar role, but there are two special semantics:

If there is a short-term large traffic at the receiving end, the processing speed of the hardware network card is much faster than that of the software IP Processing Thread. Therefore, in order to offset this speed difference, a queue is used as a buffer;
If the network card is occupied all the time, resulting in a large number of packets in the sender queue, these packets can be sent out when the network card is idle after it is completed, so as to improve the utilization of the network card.

When DMA was not applied a long time ago, the network card received a packet and put it into the built-in queue of the network card, and then sent an interrupt to the CPU. Then the CPU executed the interrupt handler to copy the data in the built-in queue of the network card byte by byte to the specified location of RAM, and then let the IP Processor Thread read it. But in fact, CPU accessing data in peripherals is an expensive action. After all, peripherals are not as "close" as CPU in memory.

However, the E1000 network card to be used in this experiment should be more advanced. The network card will directly copy the received packets to the specified location of RAM through DMA Technology (through the DMA Engine inside the network card), and then send an interrupt to the CPU. After responding to the interrupt, the interrupt handler will no longer access the network card, but will directly access the location in memory where the packet is stored. Therefore, this location is a memory address mutually agreed upon by the host when initializing the network card. Specifically, we use a packet buffer with a length of 16 and a size of 16 * 1500 bytes to store the payload of these packets in memory. When accessing these packets, the pointer to these packets is stored through another array with length of 16 and size of 16 * 64. This array is called DMA ring and is a circular array. Both transmission and reception have a DMA ring, which is called TX ring for transmission and RX ring for reception.

This is a performance graph of a router. We focus on the distribution of solid discrete points. The work of the router is to receive packets from the input network card, and then type out the packets from another output network card (the same network card can input and output packets, and the adjustment of the two network cards here is the same). DMA technology is not used here, so when the CPU wants to get the received Packet, it needs to access the network card rather than RAM. It can be seen that with the increase of input pps(packets per second), the curve rises first and falls. So the question is, why rise first and then fall, rather than rise all the time, or rise first and then remain unchanged?

The CPU has only two jobs here: one is to execute the interrupt processing program, process the interrupt of the network card at the receiving end, and copy the packet into RAM; The second is to execute a kernel thread to copy the packet to the buffer of the network card at the sending end.

The reason why the curve has risen to a section and no longer rises is not that the bandwidth of the sending network card is too small, but that the CPU utilization has reached 100%. With the current CPU performance, it is no longer possible to process more packets. Of course, it doesn't mean that the bottleneck caused by sending network card bandwidth is not necessarily caused, but in most cases, the network card bandwidth is sufficient relative to the CPU performance.

Because each incoming packet will trigger an interrupt, processing the interrupt here is a lot of CPU overhead. If the network card continues to receive packets, the CPU will always process interrupts and have no time to transfer these packets to another network card, resulting in a decrease in the rate of packets sent by another network card (this phenomenon is called Live Lock, which corresponds to Dead Lock. Dead Lock can't disappear when it's left, but Live Lock can). At this time, it is more important for the CPU to transfer the received packet to another network card, and processing the interrupt is secondary. Therefore, the CPU simply shuts down and changes from interrupt mode to polling mode to process the packets accepted by the receiving network card (actively release the Live Lock instead of passively waiting for the Live Lock to disappear). If you can use DMA technology, you can effectively slow down the Live Lock here.

Experimental part

The task of the experiment is to complete the functions of the top half of the network card driver, reading and writing one by one, which is similar to the UART introduced in the interruption article. The initialization of the I/O interface and the code of the bottom half interrupt handler have been written for us in advance, so the rest of the driver is to implement int e1000_transmit(struct mbuf *m) and void e1000_recv(void). The hints in the experimental instruction basically describes the process of reading and writing in detail, and basically implements it according to that. So although it's hard, it doesn't have much to write. If you don't look at hints and just look at the operation manual, it's really difficult.

send out:

/* kernel/e1000.c */

int
e1000_transmit(struct mbuf *m)
{
  //
  // Your code here.
  //
  // the mbuf contains an ethernet frame; program it into
  // the TX descriptor ring so that the e1000 sends it. Stash
  // a pointer so that it can be freed after sending.
  //

  
// First ask the E1000 for the TX ring index 
// at which it's expecting the next packet, 
// by reading the E1000_TDT control register.

  acquire(&e1000_lock);
  uint64 tx_ring_index = regs[E1000_TDT];

// Then check if the the ring is overflowing. 
// If E1000_TXD_STAT_DD is not set in the descriptor 
// indexed by E1000_TDT, the E1000 hasn't finished 
// the corresponding previous transmission request, 
// so return an error.

  if ((tx_ring[tx_ring_index].status & E1000_TXD_STAT_DD) == 0) {
    release(&e1000_lock);
    return -1;
  }

// Otherwise, use mbuffree() to free the last mbuf 
// that was transmitted from that descriptor (if there was one).
  if (tx_mbufs[tx_ring_index])
    mbuffree(tx_mbufs[tx_ring_index]);

// Then fill in the descriptor. 
// m->head points to the packet's content in memory, 
// and m->len is the packet length. 
// Set the necessary cmd flags (look at Section 3.3 in the E1000 manual) 
// and stash away a pointer to the mbuf for later freeing.
  tx_ring[tx_ring_index].addr = (uint64)m->head;
  tx_ring[tx_ring_index].length = m->len;
  tx_ring[tx_ring_index].cmd = E1000_TXD_CMD_EOP | E1000_TXD_CMD_RS;
  tx_mbufs[tx_ring_index] = m;


// Finally, update the ring position by adding one to E1000_TDT modulo TX_RING_SIZE.
  regs[E1000_TDT] = (tx_ring_index + 1) % TX_RING_SIZE;
  release(&e1000_lock);
  
  return 0;
}

receive:

/* kernel/e1000.c */

static void
e1000_recv(void)
{
  //
  // Your code here.
  //
  // Check for packets that have arrived from the e1000
  // Create and deliver an mbuf for each packet (using net_rx()).
  //

  // First ask the E1000 for the ring index 
  // at which the next waiting received packet (if any) is located,
  // by fetching the E1000_RDT control register and adding one modulo RX_RING_SIZE.
  while (1) {
    uint64 rx_ring_index = regs[E1000_RDT];
    rx_ring_index = (rx_ring_index + 1) % RX_RING_SIZE;

    // Then check if a new packet is available 
    // by checking for the E1000_RXD_STAT_DD bit 
    // in the status portion of the descriptor. If not, stop.
    if ((rx_ring[rx_ring_index].status & E1000_RXD_STAT_DD) == 0)
      break;

    // Otherwise, update the mbuf's m->len to the length reported in the descriptor.
    // Deliver the mbuf to the network stack using net_rx().
    rx_mbufs[rx_ring_index]->len = rx_ring[rx_ring_index].length;
    net_rx(rx_mbufs[rx_ring_index]);

    // Then allocate a new mbuf using mbufalloc() to replace the one just given to net_rx().
    // Program its data pointer (m->head) into the descriptor.
    // Clear the descriptor's status bits to zero.
    if ((rx_mbufs[rx_ring_index] = mbufalloc(0)) == 0)
      break;
    rx_ring[rx_ring_index].addr = (uint64)rx_mbufs[rx_ring_index]->head;
    rx_ring[rx_ring_index].status = 0;

    // Finally, update the E1000_RDT register to be the 
    // index of the last ring descriptor processed.
    regs[E1000_RDT] = rx_ring_index;

    // At some point the total number of packets
    // that have ever arrived will exceed the ring size (16); 
    // make sure your code can handle that.
  }
}

Course summary

The most difficult of these 11 experiments is the experiment of page table and lock. After all, the final test has not passed... When you have time to feel xv6 that you have forgotten something, just do these two experiments again. It can be regarded as a pit dug for yourself. In addition, the other two important experiments are process scheduling and file system experiments. It is important to understand the working principle. Finally, if there is still time, upgrade the fork of mmap experiment to cow fork. In short, now we have put 6 All experiments of s081 2020 have been done from beginning to end, so we can come to an end for the time being.

The process of doing this kind of foreign open course experiment is relatively difficult, because it has no teachers and students, which means that when you encounter problems, you can only consult strangers in self-study or on the Internet (but the content of the professor of MIT's course is very clear, and sure enough, he is still a teacher of a famous school); No grades, no credits and no DDL mean that learning depends solely on subjective initiative and will not be forced by the pressure of a third party. Therefore, it's best to select those with sufficient tests (each experiment should execute the corresponding uint test and then execute usertests for regression test) and those with auto grade score (make grade), otherwise you don't know whether you're doing it right or not.

At the beginning, I was completely unfamiliar with xv6, and even did not understand some of the most basic theoretical knowledge of the operating system, so that the first and second experiments reluctantly passed with a large number of references to other people's ready-made codes. In the third page table experiment, we began to do it independently. This is why the first two experiments did not have corresponding blog posts. The test of page table and lock experiment hung up and brazenly wrote blog posts. Even if you ask others, don't look at the code written by others, otherwise it's meaningless.

I don't know what a pragmatic article should look like. The technical lower limit of blogging is zero. Most of the purpose of writing this blog is not to show others. I just feel that after learning something, I have to find a place to record it, and constantly compare and correct cognition when writing, so I think this is pragmatic, because the output process is important. After recording it, it's secondary to think about when to review it again, because I don't intend to see it again after writing a lot of things.

Private is a creature that wakes up every day by its own dishes, more than.

Reference link

Ring Buffer

Topics: Operating System

Programmer Think