Interview Questions

Sending the Data from the Socket through UDP and TCP

In this, we look at what happens in the transport layer as data is transmitted. When a user writes data into an open socket, socket buffers are allocated by the transport layer and travel through the transport layer to IP where they are routed and passed to the device drivers for sending. Specifying SOCK_DGRAM in the socket call invokes the UDP protocol, and specifying SOCK_STREAM invokes the TCP protocol. For SOCK_DGRAM type sockets, the process is relatively simple, but it is far more complicated for SOCK_STREAM type sockets. We will examine both UDP and TCP and look at the functions that interface the protocol to the socket layer. Next, we will focus on the sendmsg function for each of the protocols. We will follow the data as it flows through the transport layer.

Socket Layer Glue

Before we start looking at the internals of each of the transport layer protocols, we should look at how service functions in the transport layer protocols are associated with the socket layer functions. Through this mechanism, the application program is able to direct the actions of the transport layer for each of the socket types, SOCK_STREAM and SOCK_DGRAM.As explained in "Linux Sockets," each of the two transport protocols is registered with the socket layer.

The Proto Structure

The key to this registration process is the data structure, proto, which is defined in linux/include/linux/sock.h. Most of the fields in the proto structure are function pointers. They each correspond to specific functions for each transport protocol. A transport protocol does not have to implement every function; for example, UDP does not have a shutdown function. UDP and TCP do implement most of the functions in the proto structure. The first seven functions, close through shutdown, are described in Sections for UDP and for TCP.

struct proto {
void (*close)(struct sock *sk,long timeout);
int (*connect)(struct sock *sk,struct sockaddr
*uaddr, int addr_len);
int (*disconnect)(struct sock *sk, int  flags);
struct sock * (*accept) (struct sock *sk,
int  flags, int *err);
int (*ioctl)(struct sock *sk, int cmd,unsigned long arg);
int (*init)(struct sock *sk);
int (*destroy)(struct sock *sk);
void (*shutdown)(struct sock *sk, int how);

The getsockopt and setsockopt functions and options for both UDP and TCP are discussed in detail later in this section.

int (*setsockopt)(struct sock *sk, int level,
int optname, char *optval,int optlen);
int (*getsockopt)(struct sock *sk, int level,
int optname, char *optval,int *option);

The sendmsg function is discussed in Section for UDP and TCP.

int (*sendmsg)(struct sock *sk,struct msghdr *msg, int len);

Recvmsg is covered in "Receiving the Data in the Transport Layer, UDP and TCP."

int (*recvmsg)(struct sock *sk,struct msghdr *msg, int len,
int noblock, int flags,int *addr_len);

The bind function is not implemented by either TCP or UDP within the transport protocols themselves. Instead, it is implemented at the socket layer, covered in "Linux Sockets and Socket Layer Programming."

int(*bind)(struct sock *sk,struct sockaddr *uaddr,int addr_len);

Backlog_rcv is implemented by TCP. Refer to, "Receiving Data in the Transport Layer, UDP and TCP," to see what happens when backlog_rcv is executed.

int(*backlog_rcv)(struct sock *sk,struct sk_buff *skb);

The hash and unhash functions are for manipulating hash tables. These tables are for associating the endpoint addresses (port numbers) with open sockets. The tables map transport protocol port numbers to instances of struct sock. Hash places a reference to the sock structure, sk, in the hash table.

void (*hash)(struct sock *sk);

Unhash removes the reference to sk from the hash table.

void (*unhash)(struct sock *sk);

Get_port returns the port associated with the sock structure, sk. Generally, the port is obtained from one of the protocol’s port hash tables.

int (*get_port)(struct sock *sk,unsigned short snum);

This field contains the name of the protocol, either “UDP” or “TCP”.

char name[32];
struct {
int inuse;
u8 __pad[SMP_CACHE_BYTES - sizeof(int)];
} stats[NR_CPUS];
} ;

Neither UDP nor TCP implement all of the functions in the proto structure. As we saw in the AF_INET family provides pointers to default functions that get called from the socket layer in the case where the specific transport protocol doesn’t implement a particular function. Each of the transport protocols registers a set of functions by initializing a data structure of type struct proto, defined in the file sock.h.

The Msghdr Structure

All the socket layer read and write functions are translated into calls to either rcvmsg or sendmsg, a BSD type message communication method. Internally in the socket layer, the internal functions use the msghdr structure, defined in file linux /include /linux /socket.h, to pass data to and from the underlying protocols.

struct msghdr {

Msg_name field is also known as the socket "name" or the destination address for this message. Generally, this field is cast into a pointer to a sockaddr_in. The msg_namelen field is the address length of the msg_name.

void * msg_name;
int msg_namelen;

Msg_iovec points to an array of data blocks passed either to the kernel from the application or from the kernel to the application. Msg_iovlen holds the number of data blocks pointed to by msg_iov. The msg_control field is for the BSD style file descriptor passing. Msg_controllen is the number of messages in the control message structure.

struct iovec * msg_iov;
__kernel_size_t msg_iovlen;
void * msg_control;
__kernel_size_t msg_controllen;
unsigned msg_flags;
} ;

UDP Socket Glue

As we saw in the transport protocols register with the socket layer by adding a pointer to a proto structure. UDP creates an instance of struct proto at compile time in the file linux/net/ipv4/udp.c and initializes it with values from Table.

Protocol Block Functions for UDP, Struct proto

The UDP protocol is invoked when the application layer specifies SOCK_DGRAM in the type field of the socket call. SOCK_DGRAM type sockets are fairly simple. There is no connection management or buffering. A call to one of the send functions in the application layer causes the data to be sent out immediately as a single datagram. Table shows the UDP protocol functions mapped to each of the fields in the proto structure described earlier.

TCP Socket Glue

Like UDP, TCP registers a set of functions with the socket layer. As in UDP, this is done at compile time by initializing tcp_prot with the functions shown in Table in the file linux/net/ipv4/tcp_ipv4.c. Tcp_prot is an instance of the proto structure and is initialized with the function pointers shown in Table.

Protocol Block Functions for TCP, Struct proto

Socket Options for TCP

In general, TCP is very configurable. The discussion of the internals of the TCP protocol later in this chapter and refer to various options and how they affect the performance or operation of the protocol. Section shows the TCP options structure that holds the values of many of the socket options. However, in this section, the TCP socket options and ioctl configuration options are gathered together in one place. Although, most of these are covered in some fashion in the tcp(7) man page, this section lists applicable internal constants and internal variables as well as any references to other sections in the text. The following options are set via the setsockopt system call or read back with the getsockopt system call.

TCP_CORK: If this option is set,TCP doesn’t send out frames until there is enough data to fill the maximum segment size. It allows the application to stop transmission if the route MTU is less than the Minimum Segment Size (MSS). This option is unique to Linux, and application code using it will not be portable to other operating systems (OSs). This option is held in the nonagle field in the TCP options structure, which is set to the number two. TCP_CORK is mutually exclusive with the TCP_NODELAY option

TCP_DEFER_ACCEPT: The application caller may sleep until data arrives at the socket, at which time it is awakened. The socket is also awakened when it times out. The caller specifies the number of seconds to wait for data to arrive. This option is unique to Linux, and application code using it will not be portable to other OSs. The option value is converted to the number of ticks and is kept in the defer_accept field of the TCP option structure.

TCP_INFO: The caller using this option can retrieve lots of configuration information about the socket. This is a Linux-unique option, and code using it will not necessarily be portable to other OSs. The information is returned in the tcp_info structure, defined in file


struct tcp_info

The first field, tcpi,_state, contains the current TCP state for the connection. The other fields in this structure contain statistics about the TCP connection.

__u8 tcpi_state;
__u8 tcpi_ca_state;
__u8 tcpi_retransmits;
__u8 tcpi_probes;
__u8 tcpi_backoff;
__u8 tcpi_options;
__u8 tcpi_snd_wscale : 4, tcpi_rcv_wscale : 4;
__u32 tcpi_rto;
__u32 tcpi_ato;
__u32 tcpi_snd_mss;
__u32 tcpi_rcv_mss;
__u32 tcpi_unacked;
__u32 tcpi_sacked;
__u32 tcpi_lost;
__u32 tcpi_retrans;
__u32 tcpi_fackets;

The following four fields are event time stamps; however, we don’t actually remember when an ack was sent in all circumstances.

__u32 tcpi_last_data_sent;
__u32 tcpi_last_ack_sent;
__u32 tcpi_last_data_recv;
__u32 tcpi_last_ack_recv;

The last fields are TCP metrics, such as negotiated MTU, send threshold, round-trip time, and congestion window.

__u32 tcpi_pmtu;
__u32 tcpi_rcv_ssthresh;
__u32 tcpi_rtt;
__u32 tcpi_rttvar;
__u32 tcpi_snd_ssthresh;
__u32 tcpi_snd_cwnd;
__u32 tcpi_advmss;
__u32 tcpi_reordering;
} ;

TCP_KEEPCNT: By using this option, the caller can set the number of keepalive probes that TCP will send for this socket before dropping the connection. This option is unique to Linux and should not be used in portable code. The field keepalive_probes in the tcp_opt structure is set to the value of this option. For this option to be effective, the socket level option SO_KEEPALIVE must also be set.

TCP_KEEPIDLE: With this option, the caller may specify the number of seconds that the connection will stay idle before TCP starts to send keepalive probe packets. This option is only effective if the socket option SO_KEEPALIVE is also set for this socket. This is also a nonportable Linux option. The value of this option is stored in the keepalive_time field in the TCP options structure. The value is normally set to a default of two hours.

TCP_KEEPINTVL: This option, also a nonportable Linux option, is used to specify the number of seconds between transmissions of keepalive probes. The value of this option is stored in the keepalive_intvl field in the TCP options structure and is initialized to a value of 75 seconds.

TCP_LINGER2: This option may be set to specify how long an orphaned socket in the FIN_WAIT2 state should be kept alive. The option is unique to Linux and therefore is not portable. If the value is set to zero, the option is turned off and Linux uses normal processing for the FIN_WAIT_2 and TIME_WAIT states. One aspect of this option is not documented anywhere; if the value is less than zero, the socket proceeds immediately to the CLOSED state from the FIN_WAIT_2 state without passing through the TIME_WAIT state. The value associated with this option is kept in the linger2 of the tcp_opt structure. The default value is determined by the sysctl, tcp_fin_timeout.

TCP_MAXSEG: This option specifies the maximum segment size set for a TCP socket before the connection is established. The advertised MSS value sent to the peer is determined by this option but won’t exceed the interface’s MTU. The two TCP peers for this connection may renegotiate the segment size. See Section for more details on how MSS is used by tcp_sendmsg.

TCP_NODELAY: When set, this option disables the Nagle algorithm. The value is stored in the nonagle field of the tcp_opt structure. This option may not be used if the option TCP_CORK is set. When TCP_NODELAY is set, TCP will send out data as soon as possible without waiting for enough data to fill a segment.

TCP_QUICKACK: This option may be used to turn off delayed acknowledgment by setting the value to one, or enable delayed acknowledgment by setting to a zero. Delayed acknowledgment is the normal mode of operation for Linux TCP. With delayed acknowledgment, ACKs are delayed until they can be combined with a segment waiting to be sent in the reverse direction. If the value of this option is one, the pingpong field in the ack part of tcp_opt is set to zero, which disables delayed acknowledgment. The TCP_QUICKACK option only temporarily affects the behavior of the TCP protocol. If delayed acknowledgment mode is disabled, it could eventually be "automatically" re-enabled depending on the acknowledgment timeout processing and other factors.

TCP_SYNCNT: The caller may use this option to specify the number of SYN retransmits that should be sent before aborting an attempt to establish a connection. This option is unique to Linux and should not be used for portable code. The value is stored in the syn_retries field of the tcp_opt structure.

TCP_WINDOW_CLAMP: By setting this option, the caller may specify the maximum advertised window size for this socket. The minimum allowed for the advertised window is the value SOCK_MIN_RCVBUF divided by two, which is 128 bytes. The value of this option is held in the window_clamp field of tcp_opt for this socket.

Pragna Meter
Next Chapter  
e-University Search
Related Jobs