commit | author | age
|
0a9e03
|
1 |
#+TODO: TODO(t) NEXT(n) WAITING(w) SOMEDAY(s) DELEGATED(g) PROJ(p) PLANNED(l) | DONE(d) FORWARDED(f) CANCELLED(c) |
OB |
2 |
#+startup: beamer |
|
3 |
#+LaTeX_CLASS: beamer |
|
4 |
#+LaTeX_CLASS_OPTIONS: [a4paper] |
|
5 |
#+LaTeX_CLASS_OPTIONS: [captions=tableheading] |
|
6 |
#+LATEX_HEADER: \usetheme{Warsaw} \usepackage{courier} |
|
7 |
#+LATEX_HEADER: \usepackage{textpos} |
|
8 |
#+LATEX_HEADER: \RequirePackage{fancyvrb} |
|
9 |
#+LATEX_HEADER: \DefineVerbatimEnvironment{verbatim}{Verbatim}{fontsize=\tiny} |
|
10 |
#+LATEX_HEADER: \setbeamercolor{title}{fg=green} |
|
11 |
#+LATEX_HEADER: \setbeamercolor{structure}{fg=black} |
|
12 |
#+LATEX_HEADER: \setbeamercolor{section in head/foot}{fg=green} |
|
13 |
#+LATEX_HEADER: \setbeamercolor{subsection in head/foot}{fg=green} |
|
14 |
#+LATEX_HEADER: \setbeamercolor{item}{fg=green} |
|
15 |
#+LATEX_HEADER: \setbeamerfont{frametitle}{family=\ttfamily} |
|
16 |
# logo |
|
17 |
#+LATEX_HEADER: \addtobeamertemplate{frametitle}{}{ \begin{textblock*}{100mm}(0.85\textwidth,-0.8cm) \includegraphics[height=0.7cm,width=2cm]{niit-logo.png} \end{textblock*}} |
|
18 |
#+OPTIONS: toc:nil title:nil ^:nil |
|
19 |
#+LANGUAGE: en |
|
20 |
#+TITLE: What are containers? |
|
21 |
|
|
22 |
* |
|
23 |
file:~/git/olbohlen-org/presentations/praesentation-containers-intro.png |
|
24 |
|
|
25 |
|
|
26 |
|
|
27 |
* knock knock...wake up... |
|
28 |
|
|
29 |
You want to know what containers are? |
|
30 |
|
|
31 |
|
|
32 |
|
|
33 |
* The Spoon |
|
34 |
** left :BMCOL: |
|
35 |
:PROPERTIES: |
|
36 |
:BEAMER_col: 0.5 |
|
37 |
:END: |
|
38 |
- Do not try to run containers, that's impossible. Instead, only try to realize the truth... |
|
39 |
|
|
40 |
- What truth? |
|
41 |
|
|
42 |
- There is no container... |
|
43 |
|
|
44 |
- There is no container? |
|
45 |
|
|
46 |
- Then you'll see that it is not the container which runs, it is the process itself. |
|
47 |
|
|
48 |
** right :BMCOL: |
|
49 |
:PROPERTIES: |
|
50 |
:BEAMER_col: 0.7 |
|
51 |
:END: |
|
52 |
|
|
53 |
file:~/git/olbohlen-org/presentations/neo-spoon.jpg |
|
54 |
|
|
55 |
|
|
56 |
|
|
57 |
* What Is A Process? |
|
58 |
|
|
59 |
- it has its own private memory |
|
60 |
- violations against process memory borders get a SIGSEGV(11) |
|
61 |
- a process has a heap, a stack, code (TEXT) and data (ANON) |
|
62 |
- the process can be observed by \textcolor{green}{ps}(1), which shows some attributes: |
|
63 |
|
|
64 |
#+begin_example |
|
65 |
$ ps -fp $$ |
|
66 |
UID PID PPID C STIME TTY TIME CMD |
|
67 |
olbohlen 11651 10046 0 23:07:43 pts/6 0:00 ksh |
|
68 |
#+end_example |
|
69 |
|
|
70 |
- we see the user id, process id, parent-pid, start time, the tty, the cpu time and command name |
|
71 |
- in UNIX these attributes are bundled in a C structure called proc_t |
|
72 |
- Linux uses task_struct which is a more hierarchical structure |
|
73 |
|
|
74 |
|
|
75 |
|
|
76 |
* Container Implementations |
|
77 |
|
|
78 |
There are various implementations: |
|
79 |
- Linux: OpenVZ (2005), docker (2013), podman (~2018), etc... |
|
80 |
- FreeBSD: jails (Mar 2000) |
|
81 |
- illumos/Solaris: containers (Feb 2004) |
|
82 |
- AIX: wpars |
|
83 |
and various others... |
|
84 |
|
|
85 |
|
|
86 |
|
|
87 |
* The Lady In The Red Dress |
|
88 |
|
|
89 |
Welcome to a training program, let's start a simple container with podman... |
|
90 |
|
|
91 |
#+begin_example |
|
92 |
[olbohlen@rhel85 ~]$ podman run -d ubi8 sleep 10000 |
|
93 |
6b336fb0012f6f3d8fadca333e1e2bd900b7ede9560594bb0c5acc27a3aef4ee |
|
94 |
[olbohlen@rhel85 ~]$ podman ps |
|
95 |
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES |
|
96 |
6b336fb0012f registry.access.redhat.com/ubi8:latest sleep 10000 2 seconds ago Up 2 seconds ago nifty_hypatia |
|
97 |
[olbohlen@rhel85 ~]$ ps -ef | grep "sleep 10000" |
|
98 |
olbohlen 5026 5017 0 23:44 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 10000 |
|
99 |
[root@6b336fb0012f /]# ps -ef | grep sleep |
|
100 |
root 1 0 0 22:44 ? 00:00:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 10000 |
|
101 |
[olbohlen@rhel85 ~]$ ps -fZp 5017 |
|
102 |
LABEL UID PID PPID C STIME TTY TIME CMD |
|
103 |
unconfined_u:system_r:container_runtime_t:s0 olbohlen 5017 1 0 23:44 ? 00:00:00 /usr/bin/conmon --api-version 1 -c 6b336fb0012f6f3d8fadca333e1e2bd900b7ede95 |
|
104 |
#+end_example |
|
105 |
|
|
106 |
|
|
107 |
|
|
108 |
* First a Few Details |
|
109 |
|
|
110 |
- podman uses \textcolor{green}{runc}(8) - the OCI container runtime |
|
111 |
- containers are instantiated using different technologies |
|
112 |
- namespaces: providing resource "visibilities" |
|
113 |
- cgroups: limiting compute resources as cpu and memory |
|
114 |
- chroot: creating a fake root directory |
|
115 |
- seccomp: limiting access to systemcalls |
|
116 |
- SELinux: proving extra layers to prevent escapes |
|
117 |
|
|
118 |
|
|
119 |
|
|
120 |
* The World You See Is Not Real |
|
121 |
|
|
122 |
Namespaces "scope" the visibility of various things |
|
123 |
Linux supports different types of \textcolor{green}{namespaces}(7) like: |
|
124 |
|
|
125 |
- cgroup: Cgroup root directory |
|
126 |
- ipc: System V IPC, POSIX message queues |
|
127 |
- mnt: Mount points |
|
128 |
- net: Network devices, stacks, ports, etc. |
|
129 |
- pid: Process IDs |
|
130 |
- user: User and group IDs |
|
131 |
- uts: Hostname and NIS domain name |
|
132 |
|
|
133 |
Which can isolate processes in different ways |
|
134 |
Namespaces can be created by \textcolor{green}{unshare}(1) |
|
135 |
|
|
136 |
|
|
137 |
* Let's Learn Some Kung Fu |
|
138 |
|
|
139 |
Let's build a simple container on our own with \textcolor{green}{unshare}(1) and \textcolor{green}{chroot}(1): |
|
140 |
|
|
141 |
#+begin_example |
|
142 |
$ mkdir -p ~/sysroot/{bin,lib64,proc} |
|
143 |
$ for f in $(ldd /bin/{bash,df,ls,lsns,mount,ps,uname} | \ |
|
144 |
> tr '[ :]' '\n' | grep /); do cp $f sysroot/$f; done |
|
145 |
$ sudo mount --bind /home/olbohlen/sysroot/proc /home/olbohlen/sysroot/proc |
|
146 |
$ unshare -irmnpuUCf --mount-proc=$PWD/sysroot/proc chroot $PWD/sysroot /bin/bash |
|
147 |
bash-4.4# /bin/ps -ef |
|
148 |
UID PID PPID C STIME TTY TIME CMD |
|
149 |
0 1 0 0 16:58 ? 00:00:00 /bin/bash |
|
150 |
0 2 1 0 16:58 ? 00:00:00 /bin/ps -ef |
|
151 |
bash-4.4# /bin/mount |
|
152 |
/dev/mapper/rhel_rhel85-root on /proc type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) |
|
153 |
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) |
|
154 |
#+end_example |
|
155 |
|
|
156 |
|
|
157 |
|
|
158 |
* So, Was That Real? |
|
159 |
|
|
160 |
Well, we have to trust Linux here a bit... |
|
161 |
But on other UNIX systems we can actually dig deeper: |
|
162 |
|
|
163 |
Our rabbit hole entry is the kernel debugger, which we can attach |
|
164 |
to a running UNIX kernel and observe (and modify) the system live. |
|
165 |
|
|
166 |
Allow me to do that on illumos, as the process structures are a |
|
167 |
bit more "organized". |
|
168 |
|
|
169 |
|
|
170 |
|
|
171 |
* Again A Bit Of Boring Info |
|
172 |
|
|
173 |
When we attach the kernel debugger (mdb) against a running kernel, |
|
174 |
we have raw memory access. UNIX organizes data in C structures, |
|
175 |
which may contain other data types such as int or char (or again |
|
176 |
structs).\\ |
|
177 |
\\ |
|
178 |
A simple C structure could look like this: |
|
179 |
#+begin_src C :exports code |
|
180 |
struct position { |
|
181 |
int x; |
|
182 |
int y; |
|
183 |
}; |
|
184 |
#+end_src |
|
185 |
|
|
186 |
And if we would read the struct it may look like: |
|
187 |
#+begin_example |
|
188 |
position.x = 42 |
|
189 |
position.y = 23 |
|
190 |
#+end_example |
|
191 |
|
|
192 |
|
|
193 |
|
|
194 |
* With Annoying Details... |
|
195 |
|
|
196 |
(Un)fortunately the debugger does not know the format of a data |
|
197 |
structure at a given address, so we need to validate that we |
|
198 |
got correct data.\\ |
|
199 |
\\ |
|
200 |
The debugger has some commands to look at known places for certain |
|
201 |
structures, such as the process table or in our example the list |
|
202 |
of containers.\\ |
|
203 |
\\ |
|
204 |
|
|
205 |
|
|
206 |
|
|
207 |
* Down The Rabbit Hole |
|
208 |
|
|
209 |
So let's run the debugger: |
|
210 |
|
|
211 |
#+begin_example |
|
212 |
(701) x230:/root# mdb -k |
|
213 |
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs sata sd ip hook neti sockfs arp usba i915 xhci mm smbios fctl stmf stmf_sbd lofs random idm cpc crypto fcip fcp ufs logindmux nsmb ptm smbsrv nfs sppp kvm ipc ] |
|
214 |
> ::zone |
|
215 |
ADDR ID STATUS NAME PATH |
|
216 |
fffffffffbd08c20 0 running global / |
|
217 |
fffffe16886626c0 1 running asterix /export/zones/asterix/root/ |
|
218 |
fffffe16929ebd80 2 running obelix /export/zones/obelix/root/ |
|
219 |
fffffe16bbc66500 4 running rhel85 /export/zones/rhel85/root/ |
|
220 |
#+end_example |
|
221 |
|
|
222 |
Wait, we have a container called "rhel85", wasn't that the rhel machine from the demos before? |
|
223 |
Yes, actually that container runs a bhyve hypervisor process which runs RHEL 8.5... |
|
224 |
#+begin_example |
|
225 |
(629) x230:/export/home/olbohlen$ ps -f -z rhel85 |
|
226 |
UID PID PPID C STIME TTY TIME CMD |
|
227 |
root 15267 5136 0 17:57:03 ? 1:40 /usr/sbin/bhyve -U 37960a3a-c5ac-6c8b-d14b-8204ca044474 -H -B 1,manufacturer=Op |
|
228 |
root 5113 1 0 Jan 01 ? 0:00 zsched |
|
229 |
root 5136 5113 0 Jan 01 ? 0:00 /usr/bin/python3.5 /usr/lib/brand/bhyve/init |
|
230 |
(630) x230:/export/home/olbohlen$ |
|
231 |
#+end_example |
|
232 |
|
|
233 |
* Down The Rabbit Hole |
|
234 |
|
|
235 |
We have the bhyve process with the PID 15267 running according to \textcolor{green}{ps}(1), let's look in mdb: |
|
236 |
#+begin_example |
|
237 |
> ::ps ! egrep "(PID|bhyve)" |
|
238 |
S PID PPID PGID SID UID FLAGS ADDR NAME |
|
239 |
R 15267 5136 5113 5113 0 0x4a004000 fffffe16b8d6d010 bhyve |
|
240 |
#+end_example |
|
241 |
|
|
242 |
ADDR is the start address in RAM for the proc_t data structure |
|
243 |
|
|
244 |
#+begin_example |
|
245 |
> fffffe16b8d6d010::print -a proc_t ! less |
|
246 |
[...] |
|
247 |
> fffffe16b8d6d010::print -a proc_t p_user.u_psargs |
|
248 |
fffffe16b8d6d879 p_user.u_psargs = [ "/usr/sbin/bhyve -U 37960a3a-c5ac-6c8b-d14b-8204ca044474 -H -B 1,manufacturer=Op" ] |
|
249 |
#+end_example |
|
250 |
|
|
251 |
The proc_t structure store all attributes to a process, so those that \textcolor{green}{ps}(1) shows and more. |
|
252 |
Also in that proc_t we have the container id in it (p_zone, think of it as the namespace id): |
|
253 |
|
|
254 |
#+begin_example |
|
255 |
> fffffe16b8d6d010::print -a proc_t p_zone |
|
256 |
fffffe16b8d6d658 p_zone = 0xfffffe16bbc66500 |
|
257 |
> 0xfffffe16bbc66500::zone |
|
258 |
ADDR ID STATUS NAME PATH |
|
259 |
fffffe16bbc66500 4 running rhel85 /export/zones/rhel85/root/ |
|
260 |
> |
|
261 |
#+end_example |
|
262 |
|
|
263 |
|
|
264 |
* Container Images |
|
265 |
** left :BMCOL: |
|
266 |
:PROPERTIES: |
|
267 |
:BEAMER_col: 0.5 |
|
268 |
:END: |
|
269 |
- We need images. |
|
270 |
Lots of images. |
|
271 |
** right :BMCOL: |
|
272 |
:PROPERTIES: |
|
273 |
:BEAMER_col: 0.7 |
|
274 |
:END: |
|
275 |
|
|
276 |
file:~/git/olbohlen-org/presentations/matrix-storage.jpg |
|
277 |
|
|
278 |
|
|
279 |
|
|
280 |
* Storing The Data |
|
281 |
|
|
282 |
podman/docker use so called images to instantiate containers. |
|
283 |
These images are made of Layers, like viewfoils on overhead projectors. |
|
284 |
|
|
285 |
#+begin_example |
|
286 |
[olbohlen@rhel85 scratch]$ skopeo inspect docker://registry.access.redhat.com/rhscl/postgresql-10-rhel7 \ |
|
287 |
> | jq ".Layers" |
|
288 |
[ |
|
289 |
"sha256:ac08ca107ad9ed699cbd28339749dd6463a84c73aa1d468a4241385fc4ec3876", |
|
290 |
"sha256:b46ca46c303b49d886a7585735ebd1dc8651e83d0fab5823300cf3a9fd2febc1", |
|
291 |
"sha256:cdd22b43a6f986fc909d504043ef6ad6528a6c1927f27c80eea2d19ffe5079fe", |
|
292 |
"sha256:4c9f611df095eef49c081f758ad314b62a297172e22a8a746514d252a7a89c45" |
|
293 |
] |
|
294 |
#+end_example |
|
295 |
|
|
296 |
This image contains four layers which itself are tar archives which you can extract. |
|
297 |
|
|
298 |
|
|
299 |
|
|
300 |
* Exploring The Image |
|
301 |
|
|
302 |
Let's extract an image to a local directory: |
|
303 |
|
|
304 |
#+begin_example |
|
305 |
[olbohlen@rhel85 scratch]$ skopeo copy --remove-signatures \ |
|
306 |
> docker://registry.access.redhat.com/rhscl/postgresql-10-rhel7 dir:///$PWD |
|
307 |
Copying blob ac08ca107ad9 done |
|
308 |
Copying blob b46ca46c303b done |
|
309 |
Copying blob cdd22b43a6f9 done |
|
310 |
Copying blob 4c9f611df095 done |
|
311 |
Copying config 00a55534f8 done |
|
312 |
Writing manifest to image destination |
|
313 |
Storing signatures |
|
314 |
[olbohlen@rhel85 scratch]$ ls |
|
315 |
00a55534f8db45877d6657cc9b1ba77841c49cb21cc4d7a4c9cd4e98020a4bc8 |
|
316 |
4c9f611df095eef49c081f758ad314b62a297172e22a8a746514d252a7a89c45 |
|
317 |
ac08ca107ad9ed699cbd28339749dd6463a84c73aa1d468a4241385fc4ec3876 |
|
318 |
b46ca46c303b49d886a7585735ebd1dc8651e83d0fab5823300cf3a9fd2febc1 |
|
319 |
cdd22b43a6f986fc909d504043ef6ad6528a6c1927f27c80eea2d19ffe5079fe |
|
320 |
manifest.json |
|
321 |
version |
|
322 |
#+end_example |
|
323 |
|
|
324 |
Also use \textcolor{green}{jq}(1) to inspect the manifest and the config. |
|
325 |
|
|
326 |
|
|
327 |
|
|
328 |
* Finding The Image Config |
|
329 |
|
|
330 |
There's an obvious manifest.json, so let's look into it. |
|
331 |
#+begin_example |
|
332 |
[olbohlen@rhel85 scratch]$ jq ".config.digest" <manifest.json |
|
333 |
"sha256:00a55534f8db45877d6657cc9b1ba77841c49cb21cc4d7a4c9cd4e98020a4bc8" |
|
334 |
#+end_example |
|
335 |
|
|
336 |
That's our image config, itself a json file: |
|
337 |
#+begin_example |
|
338 |
[olbohlen@rhel85 scratch]$ jq . 00a55534f8db45877d6657cc9b1ba77841c49cb21cc4d7a4c9cd4e98020a4bc8 |
|
339 |
{ |
|
340 |
"architecture": "amd64", |
|
341 |
[...] |
|
342 |
#+end_example |
|
343 |
|
|
344 |
Looks familiar? Yes, that's more or less podman inspect. |
|
345 |
|
|
346 |
In the manifest.json we also see the layers: |
|
347 |
|
|
348 |
#+begin_example |
|
349 |
[olbohlen@rhel85 scratch]$ jq ".layers[].digest" manifest.json |
|
350 |
"sha256:ac08ca107ad9ed699cbd28339749dd6463a84c73aa1d468a4241385fc4ec3876" |
|
351 |
"sha256:b46ca46c303b49d886a7585735ebd1dc8651e83d0fab5823300cf3a9fd2febc1" |
|
352 |
"sha256:cdd22b43a6f986fc909d504043ef6ad6528a6c1927f27c80eea2d19ffe5079fe" |
|
353 |
"sha256:4c9f611df095eef49c081f758ad314b62a297172e22a8a746514d252a7a89c45" |
|
354 |
[olbohlen@rhel85 scratch]$ du -h ac08ca107ad9ed699cbd2833[...] |
|
355 |
73M ac08ca107ad9ed699cbd28339749dd6463a84c73aa1d468a4241385fc4ec3876 |
|
356 |
4.0K b46ca46c303b49d886a7585735ebd1dc8651e83d0fab5823300cf3a9fd2febc1 |
|
357 |
7.0M cdd22b43a6f986fc909d504043ef6ad6528a6c1927f27c80eea2d19ffe5079fe |
|
358 |
33M 4c9f611df095eef49c081f758ad314b62a297172e22a8a746514d252a7a89c45 |
|
359 |
#+end_example |
|
360 |
These are \textcolor{green}{tar}(1) archives we can extract and inspect. |
|
361 |
When you start a container, the extracted layers will be mounted with |
|
362 |
OverlayFS. |
|
363 |
|
|
364 |
|
|
365 |
|
|
366 |
* Can We Simulate Layering? |
|
367 |
|
|
368 |
podman uses \textcolor{green}{fuse-overlayfs}(1) to mount container image layers. |
|
369 |
Since Linux 4.18 this can be done also by non-root users: |
|
370 |
|
|
371 |
#+begin_example |
|
372 |
$ mkdir layer1 |
|
373 |
$ mkdir layer2 |
|
374 |
$ mkdir ephemeral-layer |
|
375 |
$ mkdir mountdir |
|
376 |
$ echo "this is file one" >layer1/f1 |
|
377 |
$ echo "this is file two" >layer2/f2 |
|
378 |
$ fuse-overlayfs -o lowerdir=$PWD/layer1:$PWD/layer2 -o upperdir=$PWD/ephemeral-layer \ |
|
379 |
> -o workdir=$PWD/fuse-work $PWD/mountdir |
|
380 |
$ ls mountdir |
|
381 |
f1 f2 |
|
382 |
$ echo "this is file three" >mountdir/f3 |
|
383 |
$ fusermount -u $PWD/mountdir |
|
384 |
$ ls */f? |
|
385 |
ephemeral-layer/f3 layer1/f1 layer2/f2 |
|
386 |
#+end_example |
|
387 |
|
|
388 |
|
|
389 |
|
|
390 |
* Communication System |
|
391 |
** left :BMCOL: |
|
392 |
:PROPERTIES: |
|
393 |
:BEAMER_col: 0.5 |
|
394 |
:END: |
|
395 |
- We need an exit! |
|
396 |
|
|
397 |
** right :BMCOL: |
|
398 |
:PROPERTIES: |
|
399 |
:BEAMER_col: 0.7 |
|
400 |
:END: |
|
401 |
|
|
402 |
file:~/git/olbohlen-org/presentations/matrix-communication.jpg |
|
403 |
|
|
404 |
* Network Access |
|
405 |
|
|
406 |
podman uses CNI (Container Native Interface) to provide a network |
|
407 |
interface for a container (so, a namespaced NIC), which will be usuall |
|
408 |
created on a bridge. This is only possible for containers started as root: |
|
409 |
|
|
410 |
#+begin_example |
|
411 |
[olbohlen@rhel85 ~]$ sudo podman run -it registry.access.redhat.com/ubi8 \ |
|
412 |
> bash -c "(dnf install -y iproute && ip a s)" |
|
413 |
Updating Subscription Management repositories. |
|
414 |
[...] |
|
415 |
Complete! |
|
416 |
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 |
|
417 |
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 |
|
418 |
inet 127.0.0.1/8 scope host lo |
|
419 |
valid_lft forever preferred_lft forever |
|
420 |
inet6 ::1/128 scope host |
|
421 |
valid_lft forever preferred_lft forever |
|
422 |
2: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default |
|
423 |
link/ether ca:dc:cc:3a:c9:e5 brd ff:ff:ff:ff:ff:ff link-netnsid 0 |
|
424 |
inet 10.88.0.3/16 brd 10.88.255.255 scope global eth0 |
|
425 |
valid_lft forever preferred_lft forever |
|
426 |
inet6 fe80::c8dc:ccff:fe3a:c9e5/64 scope link |
|
427 |
valid_lft forever preferred_lft forever |
|
428 |
#+end_example |
|
429 |
|
|
430 |
|
|
431 |
|
|
432 |
* Communication Without Privileges |
|
433 |
|
|
434 |
Since a normal user can't instantiate interfaces usually, rootless containers |
|
435 |
can't use an interface on a bridge. Instead rootless containers use the userland |
|
436 |
tap driver (known from openvpn or virtualbox for example): |
|
437 |
|
|
438 |
#+begin_example |
|
439 |
[olbohlen@rhel85 ~]$ podman run -it registry.access.redhat.com/ubi8 \ |
|
440 |
> bash -c "(dnf install -y iproute && ip a s)" |
|
441 |
Updating Subscription Management repositories. |
|
442 |
[...] |
|
443 |
Complete! |
|
444 |
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 |
|
445 |
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 |
|
446 |
inet 127.0.0.1/8 scope host lo |
|
447 |
valid_lft forever preferred_lft forever |
|
448 |
inet6 ::1/128 scope host |
|
449 |
valid_lft forever preferred_lft forever |
|
450 |
2: tap0: <BROADCAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000 |
|
451 |
link/ether 86:21:df:f9:40:43 brd ff:ff:ff:ff:ff:ff |
|
452 |
inet 10.0.2.100/24 brd 10.0.2.255 scope global tap0 |
|
453 |
valid_lft forever preferred_lft forever |
|
454 |
inet6 fe80::8421:dfff:fef9:4043/64 scope link |
|
455 |
valid_lft forever preferred_lft forever |
|
456 |
#+end_example |
|
457 |
|
|
458 |
|
|
459 |
|
|
460 |
* A TAP On The Net |
|
461 |
|
|
462 |
The tap driver is part of the universal tun/tap driver being developed since 1999 for |
|
463 |
Linux, FreeBSD and Solaris. It allows user processes to create an interface. |
|
464 |
Depending on your code it will create a tun or a tap interface. |
|
465 |
|
|
466 |
What is the difference? |
|
467 |
|
|
468 |
- a tun interface behaves like a Point-To-Point interface and handles IP packets |
|
469 |
- a tap interface behaves like a Ethernet interface and handles Ethernet frames |
|
470 |
|
|
471 |
All packets sent to these interfaces will be received by the application which created |
|
472 |
them. Popular examples are the \textcolor{green}{pppd}(8) or openvpn. |
|
473 |
|
|
474 |
podman uses \textcolor{green}{slirp4netns}(1) to create a user-mode network interface |
|
475 |
|
|
476 |
|
|
477 |
|
|
478 |
* Let's Hack The Matrix |
|
479 |
|
|
480 |
so, first we set up our simple container again: |
|
481 |
|
|
482 |
#+begin_example |
|
483 |
$ mkdir -p ~/sysroot/{bin,lib64,proc,sbin} |
|
484 |
$ for f in $(ldd /bin/{bash,df,ls,lsns,mount,ps,uname,ping} /sbin/{ip,ifconfig} | \ |
|
485 |
> tr '[ :]' '\n' | grep /); do cp $f sysroot/$f; done |
|
486 |
$ sudo mount --bind /home/olbohlen/sysroot/proc /home/olbohlen/sysroot/proc |
|
487 |
$ unshare -irmnpuUCf --mount-proc=$PWD/sysroot/proc chroot $PWD/sysroot /bin/bash |
|
488 |
bash-4.4# /bin/ps -ef |
|
489 |
UID PID PPID C STIME TTY TIME CMD |
|
490 |
0 1 0 0 16:58 ? 00:00:00 /bin/bash |
|
491 |
0 2 1 0 16:58 ? 00:00:00 /bin/ps -ef |
|
492 |
bash-4.4# /bin/mount |
|
493 |
/dev/mapper/rhel_rhel85-root on /proc type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) |
|
494 |
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) |
|
495 |
#+end_example |
|
496 |
|
|
497 |
|
|
498 |
|
|
499 |
* Let's Hack The Matrix |
|
500 |
|
|
501 |
On the host OS: |
|
502 |
|
|
503 |
#+begin_example |
|
504 |
[olbohlen@rhel85 ~]$ pgrep -P $(pgrep -x unshare) bash |
|
505 |
2425 |
|
506 |
[olbohlen@rhel85 ~]$ slirp4netns --configure --mtu=65520 2425 tap0 |
|
507 |
sent tapfd=5 for tap0 |
|
508 |
received tapfd=5 |
|
509 |
Starting slirp |
|
510 |
* MTU: 65520 |
|
511 |
* Network: 10.0.2.0 |
|
512 |
* Netmask: 255.255.255.0 |
|
513 |
* Gateway: 10.0.2.2 |
|
514 |
* DNS: 10.0.2.3 |
|
515 |
* Recommended IP: 10.0.2.100 |
|
516 |
WARNING: 127.0.0.1:* on the host is accessible as 10.0.2.2 (set --disable-host-loopback to prohibit connecting to 127.0.0.1:*) |
|
517 |
#+end_example |
|
518 |
|
|
519 |
|
|
520 |
|
|
521 |
* Let's Hack The Matrix |
|
522 |
|
|
523 |
Back in the container: |
|
524 |
#+begin_example |
|
525 |
bash-4.4# /sbin/ip a s |
|
526 |
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 |
|
527 |
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 |
|
528 |
inet 127.0.0.1/8 scope host lo |
|
529 |
valid_lft forever preferred_lft forever |
|
530 |
inet6 ::1/128 scope host |
|
531 |
valid_lft forever preferred_lft forever |
|
532 |
2: tap0: <BROADCAST,UP,LOWER_UP> mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000 |
|
533 |
link/ether be:0c:f2:d0:28:79 brd ff:ff:ff:ff:ff:ff |
|
534 |
inet 10.0.2.100/24 brd 10.0.2.255 scope global tap0 |
|
535 |
valid_lft forever preferred_lft forever |
|
536 |
inet6 fe80::bc0c:f2ff:fed0:2879/64 scope link |
|
537 |
valid_lft forever preferred_lft forever |
|
538 |
bash-4.4# /bin/ping 10.0.2.2 |
|
539 |
PING 10.0.2.2 (10.0.2.2) 56(84) bytes of data. |
|
540 |
64 bytes from 10.0.2.2: icmp_seq=1 ttl=255 time=0.563 ms |
|
541 |
64 bytes from 10.0.2.2: icmp_seq=2 ttl=255 time=0.127 ms |
|
542 |
^C |
|
543 |
--- 10.0.2.2 ping statistics --- |
|
544 |
2 packets transmitted, 2 received, 0% packet loss, time 1007ms |
|
545 |
rtt min/avg/max/mdev = 0.127/0.345/0.563/0.218 ms |
|
546 |
#+end_example |
|
547 |
|
|
548 |
|
|
549 |
|
|
550 |
* Other Hacks With Rootless |
|
551 |
|
|
552 |
A bigger issue is that a user cannot start processes with different uids. |
|
553 |
For podman rootless containers, there is a UID mapping.\\ |
|
554 |
The file /etc/subuid specifies a range of uids per user: |
|
555 |
|
|
556 |
#+begin_example |
|
557 |
[olbohlen@rhel85 ~]$ id -a |
|
558 |
uid=4100(olbohlen) gid=4100(olbohlen) groups=4100(olbohlen),10(wheel) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 |
|
559 |
[olbohlen@rhel85 ~]$ cat /etc/subuid |
|
560 |
olbohlen:100000:65536 |
|
561 |
#+end_example |
|
562 |
That means all uids from 100000 to 165535 are reserved for olbohlen. |
|
563 |
The mapping looks like this: |
|
564 |
|
|
565 |
#+ATTR_LATEX: :environment longtable :align l|l |
|
566 |
| uid in container | uid outside container | |
|
567 |
|------------------+--------------------------| |
|
568 |
| 0 | 4100 (users primary uid) | |
|
569 |
| 1 | 100000 (first subuid) | |
|
570 |
| 2 | 100001 | |
|
571 |
| ... | ... | |
|
572 |
|
|
573 |
|
|
574 |
|
|
575 |
* And What Is A Pod? |
|
576 |
** left :BMCOL: |
|
577 |
:PROPERTIES: |
|
578 |
:BEAMER_col: 0.5 |
|
579 |
:END: |
|
580 |
- a pod is a set of containers |
|
581 |
|
|
582 |
- usually contains side-car containers |
|
583 |
|
|
584 |
- these containers share certain namespaces |
|
585 |
|
|
586 |
** right :BMCOL: |
|
587 |
:PROPERTIES: |
|
588 |
:BEAMER_col: 0.7 |
|
589 |
:END: |
|
590 |
|
|
591 |
file:~/git/olbohlen-org/presentations/matrix-pod.jpg |
|
592 |
|
|
593 |
|
|
594 |
|
|
595 |
* Why Using Side-Cars? |
|
596 |
|
|
597 |
Kubernetes does not manage containers, it manages pods as |
|
598 |
the most atomic item. |
|
599 |
#+begin_src ditaa :file pods-1.png :cmdline -E -s 0.8 |
|
600 |
Pod |
|
601 |
+------------------------------------+ |
|
602 |
| Container Container | |
|
603 |
| +--------+ +------------------+ | |
|
604 |
| | apache | | Monitoring Agent | | |
|
605 |
| | (main) | | (side car) | | |
|
606 |
| +--------+ +------------------+ | |
|
607 |
| | |
|
608 |
+------------------------------------+ |
|
609 |
^ ^ |
|
610 |
| | |
|
611 |
+------------+ +----------------+ |
|
612 |
|apache image| |Monitoring image| |
|
613 |
|{s} | |{s} | |
|
614 |
+------------+ +----------------+ |
|
615 |
#+end_src |
|
616 |
|
|
617 |
The idea is to seperate applications from helper |
|
618 |
applications to provide seperate releases. |
|
619 |
|
|
620 |
|
|
621 |
|
|
622 |
* Pods And Linux Namespaces |
|
623 |
|
|
624 |
So, what namespaces does a pod share between containers? |
|
625 |
|
|
626 |
- net: They share the IP address and ports |
|
627 |
- ipc: so you can use IPC (shared memory, semaphores, etc) |
|
628 |
- uts: all containers share the same hostname |
|
629 |
|
|
630 |
You can also enable sharing the PID namespaces by setting: |
|
631 |
|
|
632 |
v1.pod.spec.shareProcessNamespace: true |
|
633 |
|
|
634 |
|
|
635 |
|
|
636 |
* Pods And Linux Namespaces |
|
637 |
|
|
638 |
|
|
639 |
#+begin_example |
|
640 |
(738) x230:/export/home/olbohlen/scratch$ oc logs hi-7459f5c556-qkxj4 |
|
641 |
error: a container name must be specified for pod hi-7459f5c556-qkxj4, |
|
642 |
choose one of: [hi sidecarone] |
|
643 |
(741) x230:/export/home/olbohlen/scratch$ oc rsh -c sidecarone hi-7459f5c556-qkxj4 ps -ef |
|
644 |
PID USER TIME COMMAND |
|
645 |
1 10006000 0:00 sleep 360000 |
|
646 |
9 10006000 0:00 ps -ef |
|
647 |
(747) x230:/export/home/olbohlen/scratch$ oc rsh -c hi hi-7459f5c556-qkxj4 ps -ef |
|
648 |
UID PID PPID C STIME TTY TIME CMD |
|
649 |
1000600+ 1 0 0 19:33 ? 00:00:00 httpd -D FOREGROUND |
|
650 |
1000600+ 26 1 0 19:33 ? 00:00:00 /usr/bin/coreutils --coreuti |
|
651 |
1000600+ 27 1 0 19:33 ? 00:00:00 /usr/bin/coreutils --coreuti |
|
652 |
1000600+ 28 1 0 19:33 ? 00:00:00 /usr/bin/coreutils --coreuti |
|
653 |
1000600+ 29 1 0 19:33 ? 00:00:00 /usr/bin/coreutils --coreuti |
|
654 |
1000600+ 30 1 0 19:33 ? 00:00:00 httpd -D FOREGROUND |
|
655 |
1000600+ 36 1 0 19:33 ? 00:00:00 httpd -D FOREGROUND |
|
656 |
1000600+ 43 1 0 19:33 ? 00:00:00 httpd -D FOREGROUND |
|
657 |
1000600+ 64 1 0 19:33 ? 00:00:00 httpd -D FOREGROUND |
|
658 |
1000600+ 66 1 0 19:33 ? 00:00:00 httpd -D FOREGROUND |
|
659 |
1000600+ 72 1 0 19:33 ? 00:00:00 httpd -D FOREGROUND |
|
660 |
1000600+ 82 1 0 19:33 ? 00:00:00 httpd -D FOREGROUND |
|
661 |
1000600+ 88 1 0 19:33 ? 00:00:00 httpd -D FOREGROUND |
|
662 |
1000600+ 106 0 0 19:42 pts/0 00:00:00 ps -ef |
|
663 |
#+end_example |
|
664 |
|
|
665 |
* Thank You |
|
666 |
|
|
667 |
Thank you for your attention.\\ |
|
668 |
\\ |
|
669 |
Do you have any questions?\\ |
|
670 |
\\ |
|
671 |
Feel free to ask now or contact me later: |
|
672 |
|
|
673 |
[[mailto:olaf.bohlen@niit.com][olaf.bohlen@niit.com]] |