This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Traffic Redirection

In service mesh, iptables and eBPF are two common ways of intercepting traffic.

iptables is a traffic interception tool based on the Linux kernel. It can control traffic by filtering rules. Its advantages include:

  • Universality: The iptables tool has been widely used in Linux operating systems, so most Linux users are familiar with its usage.
  • Stability: iptables has long been part of the Linux kernel, so it has a high degree of stability.
  • Flexibility: iptables can be flexibly configured according to needs to control network traffic.

However, iptables also has some disadvantages:

  • Difficult to debug: Due to the complexity of the iptables tool itself, it is relatively difficult to debug.
  • Performance issues: Unpredictable latency and reduced performance as the number of services grows.
  • Issues with handling complex traffic: When it comes to handling complex traffic, iptables may not be suitable because its rule processing is not flexible enough.

eBPF is an advanced traffic interception tool that can intercept and analyze traffic in the Linux kernel through custom programs. The advantages of eBPF include:

  • Flexibility: eBPF can use custom programs to intercept and analyze traffic, so it has higher flexibility.
  • Scalability: eBPF can dynamically load and unload programs, so it has higher scalability.
  • Efficiency: eBPF can perform processing in the kernel space, so it has higher performance.

However, eBPF also has some disadvantages:

  • Higher learning curve: eBPF is relatively new compared to iptables, so it requires some learning costs.
  • Complexity: Developing custom eBPF programs may be more complex.

Overall, iptables is more suitable for simple traffic filtering and management, while eBPF is more suitable for complex traffic interception and analysis scenarios that require higher flexibility and performance.

1 - Iptables Redirection

Redirect traffic to sidecar proxy with iptables.

FSM leverages iptables to intercept and redirect traffic to and from pods participating in the service mesh to the Pipy proxy sidecar container running on each pod. Traffic redirected to the Pipy proxy sidecar is filtered and routed based on service mesh traffic policies.

For more details of comparison between iptables and eBPF, you can refer to Traffic Redirection.

How it works

FSM sidecar injector service fsm-injector injects an Pipy proxy sidecar on every pod created within the service mesh. Along with the Pipy proxy sidecar, fsm-injector also injects an init container, a specialized container that runs before any application containers in a pod. The injected init container is responsible for bootstrapping the application pods with traffic redirection rules such that all outbound TCP traffic from a pod and all inbound traffic TCP traffic to a pod are redirected to the pipy proxy sidecar running on that pod. This redirection is set up by the init container by running a set of iptables commands.

Ports reserved for traffic redirection

FSM reserves a set of port numbers to perform traffic redirection and provide admin access to the Pipy proxy sidecar. It is essential to note that these port numbers must not be used by application containers running in the mesh. Using any of these reserved port numbers will lead to the Pipy proxy sidecar not functioning correctly.

Following are the port numbers that are reserved for use by FSM:

  1. 15000: used by the Pipy admin interface exposed over localhost to return current configuration files.
  2. 15001: used by the Pipy outbound listener to accept and proxy outbound traffic sent by applications within the pod
  3. 15003: used by the Pipy inbound listener to accept and proxy inbound traffic entering the pod destined to applications within the pod
  4. 15010: used by the Pipy inbound Prometheus listener to accept and proxy inbound traffic pertaining to scraping Pipy’s Prometheus metrics
  5. 15901: used by Pipy to serve rewritten HTTP liveness probes
  6. 15902: used by Pipy to serve rewritten HTTP readiness probes
  7. 15903: used by Pipy to serve rewritten HTTP startup probes

The following are the port numbers that are reserved for use by FSM and allow traffic to bypass Pipy:

  1. 15904: used by fsm-healthcheck to serve tcpSocket health probes rewritten to httpGet health probes

Application User ID (UID) reserved for traffic redirection

FSM reserves the user ID (UID) value 1500 for the Pipy proxy sidecar container. This user ID is of utmost importance while performing traffic interception and redirection to ensure the redirection does not result in a loop. The user ID value 1500 is used to program redirection rules to ensure redirected traffic from Pipy is not redirected back to itself!

Application containers must not used the reserved user ID value of 1500.

Types of traffic intercepted

Currently, FSM programs the Pipy proxy sidecar on each pod to only intercept inbound and outbound TCP traffic. This includes raw TCP traffic and any application traffic that uses TCP as the underlying transport protocol, such as HTTP, gRPC etc. This implies UDP and ICMP traffic which can be intercepted by iptables are not intercepted and redirected to the Pipy proxy sidecar.

Iptables chains and rules

FSM’s fsm-injector service programs the init container to set up a set of iptables chains and rules to perform traffic interception and redirection. The following section provides details on the responsibility of these chains and rules.

FSM leverages four chains to perform traffic interception and redirection:

  1. PROXY_INBOUND: chain to intercept inbound traffic entering the pod
  2. PROXY_IN_REDIRECT: chain to redirect intercepted inbound traffic to the sidecar proxy’s inbound listener
  3. PROXY_OUTPUT: chain to intercept outbound traffic from applications within the pod
  4. PROXY_REDIRECT: chain to redirect intercepted outbound traffic to the sidecar proxy’s outbound listener

Each of the chains above are programmed with rules to intercept and redirect application traffic via the Pipy proxy sidecar.

Outbound IP range exclusions

Outbound TCP based traffic from applications is by default intercepted using the iptables rules programmed by FSM, and redirected to the Pipy proxy sidecar. In some cases, it might be desirable to not subject certain IP ranges to be redirected and routed by the Pipy proxy sidecar based on service mesh policies. A common use case to exclude IP ranges is to not route non-application logic based traffic via the Pipy proxy, such as traffic destined to the Kubernetes API server, or traffic destined to a cloud provider’s instance metadata service. In such scenarios, excluding certain IP ranges from being subject to service mesh traffic routing policies becomes necessary.

Outbound IP ranges can be excluded at a global mesh scope or per pod scope.

1. Global outbound IP range exclusions

FSM provides the means to specify a global list of IP ranges to exclude from outbound traffic interception applicable to all pods in the mesh, as follows:

  1. During FSM install using the --set option:

    # To exclude the IP ranges 1.1.1.1/32 and 2.2.2.2/24 from outbound interception
    fsm install --set=fsm.outboundIPRangeExclusionList="{1.1.1.1/32,2.2.2.2/24}"
    
  2. By setting the outboundIPRangeExclusionList field in the fsm-mesh-config resource:

    ## Assumes FSM is installed in the fsm-system namespace
    kubectl patch meshconfig fsm-mesh-config -n fsm-system -p '{"spec":{"traffic":{"outboundIPRangeExclusionList":["1.1.1.1/32", "2.2.2.2/24"]}}}'  --type=merge
    

    When IP ranges are set for exclusion post-install, make sure to restart the pods in monitored namespaces for this change to take effect.

Globally excluded IP ranges are stored in the fsm-mesh-config MeshConfig custom resource and are read at the time of sidecar injection by fsm-injector. These dynamically configurable IP ranges are programmed by the init container along with the static rules used to intercept and redirect traffic via the Pipy proxy sidecar. Excluded IP ranges will not be intercepted for traffic redirection to the Pipy proxy sidecar. Refer to the outbound IP range exclusion demo to learn more.

2. Pod scoped outbound IP range exclusions

Outbound IP range exclusions can be configured at pod scope by annotating the pod to specify a comma separated list of IP CIDR ranges as flomesh.io/outbound-ip-range-exclusion-list=<comma separated list of IP CIDRs>.

# To exclude the IP ranges 10.244.0.0/16 and 10.96.0.0/16 from outbound interception on the pod
kubectl annotate pod <pod> flomesh.io/outbound-ip-range-exclusion-list="10.244.0.0/16,10.96.0.0/16"

When IP ranges are annotated post pod creation, make sure to restart the corresponding pods for this change to take effect.

Outbound IP range inclusions

Outbound TCP based traffic from applications is by default intercepted using the iptables rules programmed by FSM, and redirected to the Pipy proxy sidecar. In some cases, it might be desirable to only subject certain IP ranges to be redirected and routed by the Pipy proxy sidecar based on service mesh policies, and have remaining traffic not proxied to the sidecar. In such scenarios, inclusion IP ranges can be specified.

Outbound inclusion IP ranges can be specified at a global mesh scope or per pod scope.

1. Global outbound IP range inclusions

FSM provides the means to specify a global list of IP ranges to include for outbound traffic interception applicable to all pods in the mesh, as follows:

  1. During FSM install using the --set option:

    # To include the IP ranges 1.1.1.1/32 and 2.2.2.2/24 for outbound interception
    fsm install --set=fsm.outboundIPRangeInclusionList="[1.1.1.1/32,2.2.2.2/24]"
    
  2. By setting the outboundIPRangeInclusionList field in the fsm-mesh-config resource:

    ## Assumes FSM is installed in the fsm-system namespace
    kubectl patch meshconfig fsm-mesh-config -n fsm-system -p '{"spec":{"traffic":{"outboundIPRangeInclusionList":["1.1.1.1/32", "2.2.2.2/24"]}}}'  --type=merge
    

    When IP ranges are set for inclusion post-install, make sure to restart the pods in monitored namespaces for this change to take effect.

Globally included IP ranges are stored in the fsm-mesh-config MeshConfig custom resource and are read at the time of sidecar injection by fsm-injector. These dynamically configurable IP ranges are programmed by the init container along with the static rules used to intercept and redirect traffic via the Pipy proxy sidecar. IP addresses outside the specified inclusion IP ranges will not be intercepted for traffic redirection to the Pipy proxy sidecar.

2. Pod scoped outbound IP range inclusions

Outbound IP range inclusions can be configured at pod scope by annotating the pod to specify a comma separated list of IP CIDR ranges as flomesh.io/outbound-ip-range-inclusion-list=<comma separated list of IP CIDRs>.

# To include the IP ranges 10.244.0.0/16 and 10.96.0.0/16 for outbound interception on the pod
kubectl annotate pod <pod> flomesh.io/outbound-ip-range-inclusion-list="10.244.0.0/16,10.96.0.0/16"

When IP ranges are annotated post pod creation, make sure to restart the corresponding pods for this change to take effect.

Outbound port exclusions

Outbound TCP based traffic from applications is by default intercepted using the iptables rules programmed by FSM, and redirected to the Pipy proxy sidecar. In some cases, it might be desirable to not subject certain ports to be redirected and routed by the Pipy proxy sidecar based on service mesh policies. A common use case to exclude ports is to not route non-application logic based traffic via the Pipy proxy, such as control plane traffic. In such scenarios, excluding certain ports from being subject to service mesh traffic routing policies becomes necessary.

Outbound ports can be excluded at a global mesh scope or per pod scope.

1. Global outbound port exclusions

FSM provides the means to specify a global list of ports to exclude from outbound traffic interception applicable to all pods in the mesh, as follows:

  1. During FSM install using the --set option:

    # To exclude the ports 6379 and 7070 from outbound sidecar interception
    fsm install --set=fsm.outboundPortExclusionList="{6379,7070}"
    
  2. By setting the outboundPortExclusionList field in the fsm-mesh-config resource:

    ## Assumes FSM is installed in the fsm-system namespace
    kubectl patch meshconfig fsm-mesh-config -n fsm-system -p '{"spec":{"traffic":{"outboundPortExclusionList":[6379, 7070]}}}'  --type=merge
    

    When ports are set for exclusion post-install, make sure to restart the pods in monitored namespaces for this change to take effect.

Globally excluded ports are are stored in the fsm-mesh-config MeshConfig custom resource and are read at the time of sidecar injection by fsm-injector. These dynamically configurable ports are programmed by the init container along with the static rules used to intercept and redirect traffic via the Pipy proxy sidecar. Excluded ports will not be intercepted for traffic redirection to the Pipy proxy sidecar.

2. Pod scoped outbound port exclusions

Outbound port exclusions can be configured at pod scope by annotating the pod with a comma separated list of ports as flomesh.io/outbound-port-exclusion-list=<comma separated list of ports>:

# To exclude the ports 6379 and 7070 from outbound interception on the pod
kubectl annotate pod <pod> flomesh.io/outbound-port-exclusion-list=6379,7070

When ports are annotated post pod creation, make sure to restart the corresponding pods for this change to take effect.

Inbound port exclusions

Similar to outbound port exclusions described above, inbound traffic on pods can be excluded from being proxied to the sidecar based on the ports the traffic is directed to.

1. Global inbound port exclusions

FSM provides the means to specify a global list of ports to exclude from inbound traffic interception applicable to all pods in the mesh, as follows:

  1. During FSM install using the --set option:

    # To exclude the ports 6379 and 7070 from inbound sidecar interception
    fsm install --set=fsm.inboundPortExclusionList="[6379,7070]"
    
  2. By setting the inboundPortExclusionList field in the fsm-mesh-config resource:

    ## Assumes FSM is installed in the fsm-system namespace
    kubectl patch meshconfig fsm-mesh-config -n fsm-system -p '{"spec":{"traffic":{"inboundPortExclusionList":[6379, 7070]}}}'  --type=merge
    

    When ports are set for exclusion post-install, make sure to restart the pods in monitored namespaces for this change to take effect.

2. Pod scoped inbound port exclusions

Inbound port exclusions can be configured at pod scope by annotating the pod with a comma separated list of ports as flomesh.io/inbound-port-exclusion-list=<comma separated list of ports>:

# To exclude the ports 6379 and 7070 from inbound sidecar interception on the pod
kubectl annotate pod <pod> flomesh.io/inbound-port-exclusion-list=6379,7070

When ports are annotated post pod creation, make sure to restart the corresponding pods for this change to take effect.

2 - eBPF Redirection

Using eBPF for traffic interception and communication.

FSM comes with eBPF functionality and provides users an options to use eBPF over default iptables.

The minimum kernel version is 5.4.

This guide shows how to start using this new functionality and enjoy the benefits eBPF. If you want to directly jump into quick start, refer to eBPF setup quickstart guide

For more details of comparison between iptables and eBPF, you can refer to Traffic Redirection.

Architecture

To provide eBPF features, Flomesh Service Mesh provides the fsm-cni CNI implementation and fsm-interceptor running on each node, where fsm-cni is compatible with mainstream CNI plugins.

When kubelet creates a pod on a node, it calls the CNI interface through the container runtime CRI to create the pod’s network namespace. After the pod’s network namespace is created, fsm-cni calls the interface of fsm-interceptor to load the BPF program and attach it to the hook point. In addition, fsm-interceptor also maintains pod information in eBPF Maps.

Implementation Principles

Next, we will introduce the implementation principles of the two features brought by the introduction of eBPF, but please note that many processing details will be ignored here.

Traffic interception

Outbound traffic

The figure below shows the interception of outbound traffic. Attach a BPF program to the socket operation connect, and in the program determine whether the current pod is managed by the service mesh, that is, whether it has a sidecar injected, and then modify the destination address to 127.0.0.1 and the destination port to the sidecar’s outbound port 15003. It is not enough to just modify it. The original destination address and port should also be saved in a map, using the socket’s cookie as the key.

After the connection with the sidecar is established, the original destination is saved in another map through a program attached to the mount point sock_ops, using local address + port and remote address + port as the key. When the sidecar accesses the target application later, it obtains the original destination through the getsockopt operation on the socket. Yes, a eBPF program is also attached to getsockopt, which retrieves the original destination address from the map and returns it.

Inbound traffic

For the interception of inbound traffic, the traffic originally intended for the application port is forwarded to the sidecar’s inbound port 15003. There are two cases:

  • In the first case, the requester and the service are located on the same node. After the requester’s sidecar connect operation is intercepted, the destination port is changed to 15003.
  • In the second case, the requester and the service are located on different nodes. When the handshake packet reaches the service’s network namespace, it is intercepted by the BPF program attached to the tc (traffic control) ingress, and the port is modified to 15003, achieving a functionality similar to DNAT.

Network communication acceleration

In Kubernetes networks, network packets unavoidably undergo multiple kernel network protocol stack processing. eBPF accelerates network communication by bypassing unnecessary kernel network protocol stack processing and directly exchanging data between two sockets that are peers.

The figure in the traffic interception section shows the sending and receiving trajectories of messages. When the program attached to sock_ops discovers that the connection is successfully established, it saves the socket in a map, using local address + port and remote address + port as the key. As the two sockets are peers, their local and remote information is opposite, so when a socket sends a message, it can directly address the peer socket from the map.

This solution also applies to communication between two pods on the same node.

Prerequisites

  • Ubuntu 20.04
  • Kernel 5.15.0-1034
  • 2c4g VM * 3:master、node1、node2

Install CNI Plugin

Execute the following command on all nodes to download the CNI plugin.

sudo mkdir -p /opt/cni/bin
curl -sSL https://github.com/containernetworking/plugins/releases/download/v1.1.1/cni-plugins-linux-amd64-v1.1.1.tgz | sudo tar -zxf - -C /opt/cni/bin

Master Node

Get the IP address of the master node. (Your machine IP might be different)

export MASTER_IP=10.0.2.6

Kubernetes cluster uses the k3s distribution, but when installing the cluster, you need to disable the flannel integrated by k3s and use independently installed flannel for validation. This is because k3s’s doesn’t follow Flannel directory structure /opt/cni/bin and store its CNI bin directory at /var/lib/rancher/k3s/data/xxx/bin where xxx is some randomly generated text.

curl -sfL https://get.k3s.io | sh -s - --disable traefik --disable servicelb --flannel-backend=none --advertise-address $MASTER_IP --write-kubeconfig-mode 644 --write-kubeconfig ~/.kube/config

Install Flannel. Note that the default Pod CIDR of Flannel is 10.244.0.0/16, and we will modify it to k3s’s default 10.42.0.0/16.

curl -s https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml | sed 's|10.244.0.0/16|10.42.0.0/16|g' | kubectl apply -f -

Get the access token of the API server for initializing worker nodes.

sudo cat /var/lib/rancher/k3s/server/node-token

Worker Node

Use the IP address of the master node and the token obtained earlier to initialize the node.

export INSTALL_K3S_VERSION=v1.23.8+k3s2
export NODE_TOKEN=K107c1890ae060d191d347504740566f9c506b95ea908ba4795a7a82ea2c816e5dc::server:2757787ec4f9975ab46b5beadda446b7
curl -sfL https://get.k3s.io | K3S_URL=https://${MASTER_IP}:6443 K3S_TOKEN=${NODE_TOKEN} sh -

Download FSM CLI

system=$(uname -s | tr [:upper:] [:lower:])
arch=$(dpkg --print-architecture)
release=v1.2.3
curl -L https://github.com/flomesh-io/fsm/releases/download/${release}/fsm-${release}-${system}-${arch}.tar.gz | tar -vxzf -
./${system}-${arch}/fsm version
sudo cp ./${system}-${arch}/fsm /usr/local/bin/

Install FSM

export fsm_namespace=fsm-system 
export fsm_mesh_name=fsm 

fsm install \
    --mesh-name "$fsm_mesh_name" \
    --fsm-namespace "$fsm_namespace" \
    --set=fsm.trafficInterceptionMode=ebpf \
    --set=fsm.fsmInterceptor.debug=true \
    --timeout=900s

Deploy Sample Application

#Sample services
kubectl create namespace ebpf
fsm namespace add ebpf

kubectl apply -n ebpf -f https://raw.githubusercontent.com/flomesh-io/fsm-docs/main/manifests/samples/interceptor/curl.yaml
kubectl apply -n ebpf -f https://raw.githubusercontent.com/flomesh-io/fsm-docs/main/manifests/samples/interceptor/pipy-ok.yaml

#Schedule Pods to Different Nodes
kubectl patch deployments curl -n ebpf -p '{"spec":{"template":{"spec":{"nodeName":"node1"}}}}'
kubectl patch deployments pipy-ok-v1 -n ebpf -p '{"spec":{"template":{"spec":{"nodeName":"node1"}}}}'
kubectl patch deployments pipy-ok-v2 -n ebpf -p '{"spec":{"template":{"spec":{"nodeName":"node2"}}}}'

sleep 5

#Wait for dependent Pods to start successfully
kubectl wait --for=condition=ready pod -n ebpf -l app=curl --timeout=180s
kubectl wait --for=condition=ready pod -n ebpf -l app=pipy-ok -l version=v1 --timeout=180s
kubectl wait --for=condition=ready pod -n ebpf -l app=pipy-ok -l version=v2 --timeout=180s

Testing

During testing, you can view the debug logs of BPF program execution by viewing the kernel tracing logs on the worker node using the following command. To avoid interference caused by sidecar communication with the control plane, first obtain the IP address of the control plane.

kubectl get svc -n fsm-system fsm-controller -o jsonpath='{.spec.clusterIP}'
10.43.241.189

Execute the following command on both worker nodes.

sudo cat /sys/kernel/debug/tracing/trace_pipe | grep bpf_trace_printk | grep -v '10.43.241.189'

Execute the following command on both worker nodes.

curl_client="$(kubectl get pod -n ebpf -l app=curl -o jsonpath='{.items[0].metadata.name}')"
kubectl exec ${curl_client} -n ebpf -c curl -- curl -s pipy-ok:8080

You should receive results similar to the following, and the kernel tracing logs should also output the debug logs of the BPF program accordingly (the content is quite long, so it will not be shown here).

Hi, I am pipy ok v1 !
Hi, I am pipy ok v2 !