23.8. Data Structures Featured in This Part of the Book
The section "Main IPv4 Data Structures" in Chapter 19 gave a brief overview of the main data structures. This section has a detailed description of each data structure type. Figure 23-3 shows the file that defines each data structure.
23.8.1. iphdr Structure
The meaning of its fields has already been covered in the section "IP Header" in Chapter 18.
23.8.2. ip_options Structure
This structure represents the options for a packet that needs to be transmitted or forwarded. The options are stored in this structure because it is easier to read than the corresponding portion of the IP header itself.

Let's go field by field. They should be fairly simple to understand if you have read the section "IP Options" in Chapter 18. After this description, you will be able to understand more easily how the parsing is done and how its results are used by the IP layer subsystems, such as the code that processes incoming IP packets. Some of the bit fields are grouped together into an unsigned char; the declarations of these end with :1.
unsigned char optlen
Length of the set of options. As explained in Chapter 18, this is limited to a maximum of 40 bytes by the definition of the IP header.
unsigned char is_changed:1
Set if the IP header has been modified (such as an IP address or a timestamp). This is useful to know because if the packet has to be forwarded, this field indicates that the IP checksum has to be recomputed.
_ _u32 faddr
unsigned char is_strictroute:1
unsigned char srr
unsigned char srr_is_hit:1
faddr is meaningful only for transmitted packets (that is, those generated locally) and only for those using source routing. The value of faddr is set to the first of the IP addresses provided for source routing. See the section "Option: Strict and Loose Source Routing" in Chapter 19.
is_strictroute is a flag set to true when Strict Source Route is among the options.
srr contains the offset of the Source Route option in the header. If the option is not used, the value is zero.
srr_is_hit is true if the packet was source routed and the IP address of the receiving interface is one of the addresses in the source route list (see ip_options_rcv_srr in net/ipv4/ip_options.c).
unsigned char rr
When rr is nonzero, Record Route is one of the IP options and the value of this field represents the offset inside the IP header where the option starts. This field is used together with rr_needaddr.
unsigned char rr_needaddr:1
When rr_needaddr is true, Record Route is one of the IP options and there is still room in the header for another route; therefore, the current node should copy the IP address of the outgoing interface into the IP header at the offset specified by rr.
unsigned char ts
When ts is nonzero, Timestamp is one of the IP options and this field represents the offset inside the IP header where the option starts. This field is used together with ts_needaddr and ts_needtime.
unsigned char is_setbyuser:1
This field makes sense only for transmitted packets and is set when the options were passed from user space with the system call setsockopt. Currently, however, it is never used.
unsigned char is_data:1
unsigned char _data[0]
These fields are used in two situations: when the local node transmits a locally generated packet, and when the local node replies to an ICMP echo request. In these cases, is_data is true and _data points to an area containing the options to append to the IP header. The [0] definition is a common convention used for reserving space for a pointer.
When forwarding a packet, the options are in the associated skb buffer (see the ip_options_get function in the net/ipv4/ip_options.c file).
unsigned char ts_needtime:1
When this option is true, Timestamp is one of the IP options and there is still room in the header for another timestamp; therefore, the current node should add the time of transmission into the IP header at the offset specified by ts.
unsigned char ts_needaddr:1
Used with ts and ts_needtime to indicate that the IP address of the egress device should also be copied into the IP header.
unsigned char router_alert
When this option is true, Router Alert is one of the IP options.
unsigned char _ _pad1, _ _pad2
Because memory accesses are faster when the location is aligned to a 32-bit boundary, the Linux kernel data structures are often padded out with unused fields called _ _padn in order to make their sizes a multiple of 32 bits. This is the only purpose of _ _pad1 and _ _pad2; they are not used otherwise.
The flags srr, rr, and ts also are useful when parsing the options in order to detect the ones that are present more than once, which is illegal (see the section "Option Parsing" in Chapter 19).
23.8.3. ipcm_cookie Structure
This structure combines various pieces of information needed to transmit a packet.
struct ipcm_cookie
{
u32 addr;
int oif;
struct ip_options *opt;
};
The destination IP address is addr, the egress device is oif if defined, and the IP options are in an ip_options structure. Note that addr is the only field that is always set. oif is 0 if there are no constraints on which device to use.
23.8.4. ipq Structure
Here is the description of the fields of the ipq structure. For the sake of simplicity, not all fields are shown in Figure 22-1 in Chapter 22.
struct ipq *next
When the fragments are put into the ipq_hash hash table, conflicting elements (elements with the same hash value) are linked together with this field. Note that this field does not indicate the order of fragments within the packet; it is used simply as a standard way to organize the hash table. The order of fragments within the packet is controlled by the fragments field (see Figure 22-1 in Chapter 22).
struct ipq **pprev
Pointer back to the head of the list of IP packets that have the same hash value.
struct list_head lru_list
All of the ipq structures are kept sorted in a global list, ipq_lru_list, based on a least-recently-used criterion. This list is useful when performing garbage collection. This field is used to link the ipq structure to such a list.
u32 user
The reason why an IP packet is to be defragmented, which indirectly says what kernel subsystem asked for the defragmentation. The list of allowed values for IP_DEFRAG_XXX is in include/net/ip.h. The most common one is IP_DEFRAG_LOCAL_DELIVER, which is used when defragmenting ingress packets that are to be delivered locally.
u32 saddr
u32 daddr
u16 id
u8 protocol
These parameters represent the source IP address, destination IP address, IP packet ID, and L4 protocol identifier, respectively. As described in Chapter 18, these four parameters identify the original IP packet a fragment belongs to. For that reason, they are also the parameters used by the hash function to optimally spread elements throughout the hash table.
u8 last_in
Stores three flags, whose possible values are:
COMPLETE
All of the fragments have been received and can therefore be joined together to obtain the original IP packet. This flag can also be used to mark those ipq structures that have been chosen for deletion (see ipq_kill in net/ipv4/ip_fragment.c).
FIRST_IN
The first of the fragments (the one with offset=0) has been received. The first fragment is the only one carrying all of the options that were in the original IP packet.
LAST_IN
The last of the fragments (the one with MF=0) has been received. The last fragment is important because it is the one that tells us the size of the original IP packet.
struct sk_buff *fragments
List of fragments received so far.
int len
Offset where the fragment with the biggest offset ends. When the last fragment is received (the one with MF=0), len will tell the size of the original IP packet.
int meat
Represents how many bytes of the original packet we have received so far. When its value is the same as len, the packet has been completely received.
spinlock_t lock
Protects the structure from race conditions. It could happen, for instance, that different IP fragments are received at the same time by different NICs handled by different CPUs.
atomic_t refcnt
Counter used to keep track of external references to this packet. As an example of its purpose, the timer timer increments refcnt to make sure that no one is going to free the ipq structure while the timer is still pending; otherwise, the timer might expire and try to access a data structure that does not exist anymore. You can imagine the consequences.
struct timer_list timer
Chapter 18 explained why IP fragments cannot stay forever in memory and should be removed after some time if defragmentation is not possible. This field is the timer that takes care of that.
int iif
ID of the device from which the last fragment was received. When a list of fragments expires, this field is used to decide which device to use to transmit the FRAGMENTATION REASSEMBLY TIMEOUT ICMP message (see ip_expire in the net/ipv4/ip_fragment.c file).
struct timeval stamp
Time when the last fragment was received (see ip_frag_queue in net/ipv4/ip_fragment.c).
The ipq_hash table is protected by ipfrag_lock, which can be taken in either shared (read-only) or exclusive (read-write) mode. Do not confuse this lock with the one embedded in each ipq element.
23.8.5. inet_peer Structure
The kernel keeps an instance of this structure for each remote host it has been talking to in the recent past. In the section "Long-Living IP Peer Information," you saw how it is used. All instances of inet_peer structures are kept in an AVL tree, a structure optimized for frequent lookups. The functions used to manipulate inet_peer instances are in net/ipv4/inetpeer.c.
struct inet_peer *avl_left
struct inet_peer *avl_right
Left and right pointers to the two subtrees.
_ _u16 avl_height
Height of the AVL tree.
struct inet_peer *unused_next
struct inet_peer **unused_prevp
Used to link the node into a list that contains elements that expired. unused_prevp is used to check whether the node is in that list.
A node can be put into that list and then taken back out of it several times without ever being removed completely. See the section "Garbage Collection."
unsigned long dtime
Time when this element was added to the unused list inet_peer_unused_head via inet_putpeer.
atomic_t refcnt
Reference count for the element. Among the users of this structure are the routing subsystem and the TCP layer.
_ _u32 v4daddr
IP address of the remote peer.
_ _u16 ip_id_count
IP packet ID to use next for this peer (see inet_getid in include/net/inetpeer.h).
_ _u32 tcp_ts
unsigned long tcp_ts_stamp
Used by TCP to manage timestamps.
23.8.6. ipstats_mib Structure
The SNMP protocol employs a type of object called an MIB to collect statistics about systems. A data structure called ipstats_mib keeps statistics on the IP layer. The section "IP Statistics" covered this structure in more detail.
23.8.7. in_device Structure
The in_device structure stores all of the IPv4-related configuration for a network device, such as changes made by a user with the ifconfig or ip command. This structure is linked to the net_device structure via net_device->ip_ptr and can be retrieved with in_dev_get and _ _in_dev_get. The difference between those two functions is that the first one takes care of all of the necessary locking, and the second one assumes the caller has taken care of it already.
Since in_dev_get internally increases a reference count on the in_dev structure when it succeeds (i.e., when a device is configured to support IPv4), its caller is supposed to decrement the reference count with in_dev_put when it is done with the structure.
The structure is allocated and linked to the device with inetdev_init, which is called when the first IPv4 address is configured on the device. Here are the meanings of its fields:
struct net_device *dev
Pointer back to the associated net_device structure.
atomic_t refcnt
Reference count. The structure cannot be freed until this field is 0.
int dead
This field is set to mark the device as dead. This is useful to detect those cases where the entry cannot be destroyed because it has a nonzero reference count, but a destroy action has been initiated. The two most common events that trigger the removal of an in_device structure are:
struct in_ifaddr *ifa_list
List of IPv4 addresses configured on the device. The in_ifaddr instances are kept sorted by scope (bigger scope first), and elements with the same scope are kept sorted by address type (primary first). The in_ifaddr data structure is further described in the section "in_ifaddr Structure."
struct neigh_parms *arp_parms
The meaning of this field is described in detail in Part VI.
struct ipv4_devconf cnf
See the section "ipv4_devconf Structure"
struct rcu_head rcu_head
Used by the RCU mechanism to enforce mutual exclusion. It accomplishes the same job as a lock.
The rest of the fields are used by the multicast code. For instance, mc_list stores the device's multicast configuration and it is the multicast counterpart of ifa_list. mr_vl_seen and mr_v2_seen are timestamps used by the IGMP protocol to keep track of the reception of versions 1 and 2 IGMP packets.
23.8.8. in_ifaddr Structure
When configuring an IPv4 address on an interface, the kernel creates an in_ifaddr structure that includes the 4-byte address along with several other fields. Here are their meanings:
struct in_ifaddr *ifa_next
Pointer to the next element in the list. The list contains all of the addresses configured on the device.
struct in_device *ifa_dev
Pointer back to the associated in_device structure.
u32 ifa_local
u32 ifa_address
The values of these two fields depend on whether the address is assigned to a tunnel interface. If so, ifa_local and ifa_address are the local and remote addresses of the tunnel, respectively. If not, both contain the address of the local interface.
u32 ifa_mask
unsigned char ifa_prefixlen
ifa_mask is the netmask associated with the address. ifa_prefixlen is the number of 1s that compose the netmask. Since they are different ways of representing the same information, one of the two is normally computed from the other. This is done, for instance, by the ip and ifconfig user-space configuration tools described in the section "IP Configuration." ip passes the kernel ifa_prefixlen and lets the latter compute ifa_mask, whereas ifconfig does the opposite. The kernel provides some functions to convert a netmask into a prefix length, and vice versa.
u32 ifa_broadcast
Broadcast address.
u32 ifa_anycast
Anycast address.
unsigned char ifa_scope
Scope of the address. The default is RT_SCOPE_UNIVERSE (which corresponds to the value 0) and the field is usually set to that value by ifconfig/ip, although a different value can be chosen. The main exception is an address in the range 127.x.x.x, which is given the RT_SCOPE_HOST scope. See Chapter 30 for more details.
unsigned char ifa_flags
The possible IFA_F_XXX bit flags are listed in include/linux/rtnetlink.h. Here is the one used by IPv4:
IFA_F_SECONDARY
When a new address is added to a device that already has another address with the same subnet, it is tagged as secondary.
The other flags are used by IPv6.
char ifa_label[IFNAMSIZ]
A string used mostly for backward compatibility with 2.0.x kernels that allowed aliased interfaces with names such as eth0:1.
struct rcu_head rcu_head
Used by the RCU mechanism to enforce mutual exclusion. It accomplishes the same job as a lock.
23.8.9. ipv4_devconf Structure
The ipv4_devconf data structure, whose fields are exported via /proc in /proc/sys/net/ipv4/conf/, is used to tune the behavior of a network device. There is an instance for each device, plus one that stores the default values (ipv4_devconf_dflt). The meanings of its fields are covered in Chapters 29 and 36, with the exception of promote_secondaries, which is described in the section "Main Functions That Manipulate IP Addresses and Configuration."
23.8.10. ipv4_config Structure
While ipv4_devconf structures are used to store per-device configuration, ipv4_config stores configuration that applies to the host.
Here is a brief description of its fields:
int log_martians
This parameter is also present in the ipv4_devconf structure. It is used to decide whether to print warning messages to the console when specific errors occur. Its value is not checked directly, but via the macro IN_DEV_LOG_MARTIANS, which gives higher priority to the per-device instance.
int autoconfig
Not used.
int no_pmtu_disc
Used to initialize the variable inet_sock->pmtudisc that stores the PMTU configuration for a socket. See Chapter 18 for more details on path MTU discovery.
23.8.11. cork Structure
The cork structure, defined in include/linux/ip.h inside the definition of inet_sock, is used to handle the socket cork option (UDP_CORK for UDP, TCP_CORK for TCP). We saw in Chapter 21 how its fields are used to maintain some context information across consecutive invocations of ip_append_data and ip_append_page to handle data fragmentation.
Here is a brief description of its fields:
unsigned int flags
Currently only one flag used by IPv4 can be set: IPCORK_OPT. When this flag is set, it means there are options in opt.
unsigned int fragsize
Size of the data fragments generated. This includes both payload and L3 header and is normally the PMTU.
struct ip_options *opt
IP options to use.
struct rtable *rt
Routing table cache entry that will be used to transmit the IP packet.
int length
Size of the IP packet (sum of all the data fragments, not including IP headers).
u32 addr
Destination IP address.
struct flowi fl
Collection of information about the two ends of the connection. More details are in Chapter 36.
23.8.12. skb_frag_t Structure
We saw in Chapter 21 what a paged buffer looks like (see, for example, Figure 21-5 in that chapter). skb_frag_t includes the fields necessary to identify a data block on a memory page:
struct page *page
Pointer to the memory page. On i386, the page size is 4 KB. To find the size of a page on any given architecture xxx, look for PAGE_SIZE in include/asm-xxx/page.h.
_ _u16 page_offset
Offset, relative to the beginning of the page, where the fragment starts.
_ _u16 size
Size of the fragment.
 |