Skip to content

Commit 864ea0e

Browse files
beaubelgraverostedt
authored andcommitted
user_events: Add documentation file
Add a documentation file about user_events with example code, etc. explaining how it may be used. Link: https://lkml.kernel.org/r/20220118204326.2169-13-beaub@linux.microsoft.com Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
1 parent c57eb47 commit 864ea0e

2 files changed

Lines changed: 217 additions & 0 deletions

File tree

Documentation/trace/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,3 +30,4 @@ Linux Tracing Technologies
3030
stm
3131
sys-t
3232
coresight/index
33+
user_events
Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
=========================================
2+
user_events: User-based Event Tracing
3+
=========================================
4+
5+
:Author: Beau Belgrave
6+
7+
Overview
8+
--------
9+
User based trace events allow user processes to create events and trace data
10+
that can be viewed via existing tools, such as ftrace, perf and eBPF.
11+
To enable this feature, build your kernel with CONFIG_USER_EVENTS=y.
12+
13+
Programs can view status of the events via
14+
/sys/kernel/debug/tracing/user_events_status and can both register and write
15+
data out via /sys/kernel/debug/tracing/user_events_data.
16+
17+
Programs can also use /sys/kernel/debug/tracing/dynamic_events to register and
18+
delete user based events via the u: prefix. The format of the command to
19+
dynamic_events is the same as the ioctl with the u: prefix applied.
20+
21+
Typically programs will register a set of events that they wish to expose to
22+
tools that can read trace_events (such as ftrace and perf). The registration
23+
process gives back two ints to the program for each event. The first int is the
24+
status index. This index describes which byte in the
25+
/sys/kernel/debug/tracing/user_events_status file represents this event. The
26+
second int is the write index. This index describes the data when a write() or
27+
writev() is called on the /sys/kernel/debug/tracing/user_events_data file.
28+
29+
The structures referenced in this document are contained with the
30+
/include/uap/linux/user_events.h file in the source tree.
31+
32+
**NOTE:** *Both user_events_status and user_events_data are under the tracefs
33+
filesystem and may be mounted at different paths than above.*
34+
35+
Registering
36+
-----------
37+
Registering within a user process is done via ioctl() out to the
38+
/sys/kernel/debug/tracing/user_events_data file. The command to issue is
39+
DIAG_IOCSREG.
40+
41+
This command takes a struct user_reg as an argument::
42+
43+
struct user_reg {
44+
u32 size;
45+
u64 name_args;
46+
u32 status_index;
47+
u32 write_index;
48+
};
49+
50+
The struct user_reg requires two inputs, the first is the size of the structure
51+
to ensure forward and backward compatibility. The second is the command string
52+
to issue for registering. Upon success two outputs are set, the status index
53+
and the write index.
54+
55+
User based events show up under tracefs like any other event under the
56+
subsystem named "user_events". This means tools that wish to attach to the
57+
events need to use /sys/kernel/debug/tracing/events/user_events/[name]/enable
58+
or perf record -e user_events:[name] when attaching/recording.
59+
60+
**NOTE:** *The write_index returned is only valid for the FD that was used*
61+
62+
Command Format
63+
^^^^^^^^^^^^^^
64+
The command string format is as follows::
65+
66+
name[:FLAG1[,FLAG2...]] [Field1[;Field2...]]
67+
68+
Supported Flags
69+
^^^^^^^^^^^^^^^
70+
**BPF_ITER** - EBPF programs attached to this event will get the raw iovec
71+
struct instead of any data copies for max performance.
72+
73+
Field Format
74+
^^^^^^^^^^^^
75+
::
76+
77+
type name [size]
78+
79+
Basic types are supported (__data_loc, u32, u64, int, char, char[20], etc).
80+
User programs are encouraged to use clearly sized types like u32.
81+
82+
**NOTE:** *Long is not supported since size can vary between user and kernel.*
83+
84+
The size is only valid for types that start with a struct prefix.
85+
This allows user programs to describe custom structs out to tools, if required.
86+
87+
For example, a struct in C that looks like this::
88+
89+
struct mytype {
90+
char data[20];
91+
};
92+
93+
Would be represented by the following field::
94+
95+
struct mytype myname 20
96+
97+
Deleting
98+
-----------
99+
Deleting an event from within a user process is done via ioctl() out to the
100+
/sys/kernel/debug/tracing/user_events_data file. The command to issue is
101+
DIAG_IOCSDEL.
102+
103+
This command only requires a single string specifying the event to delete by
104+
its name. Delete will only succeed if there are no references left to the
105+
event (in both user and kernel space). User programs should use a separate file
106+
to request deletes than the one used for registration due to this.
107+
108+
Status
109+
------
110+
When tools attach/record user based events the status of the event is updated
111+
in realtime. This allows user programs to only incur the cost of the write() or
112+
writev() calls when something is actively attached to the event.
113+
114+
User programs call mmap() on /sys/kernel/debug/tracing/user_events_status to
115+
check the status for each event that is registered. The byte to check in the
116+
file is given back after the register ioctl() via user_reg.status_index.
117+
Currently the size of user_events_status is a single page, however, custom
118+
kernel configurations can change this size to allow more user based events. In
119+
all cases the size of the file is a multiple of a page size.
120+
121+
For example, if the register ioctl() gives back a status_index of 3 you would
122+
check byte 3 of the returned mmap data to see if anything is attached to that
123+
event.
124+
125+
Administrators can easily check the status of all registered events by reading
126+
the user_events_status file directly via a terminal. The output is as follows::
127+
128+
Byte:Name [# Comments]
129+
...
130+
131+
Active: ActiveCount
132+
Busy: BusyCount
133+
Max: MaxCount
134+
135+
For example, on a system that has a single event the output looks like this::
136+
137+
1:test
138+
139+
Active: 1
140+
Busy: 0
141+
Max: 4096
142+
143+
If a user enables the user event via ftrace, the output would change to this::
144+
145+
1:test # Used by ftrace
146+
147+
Active: 1
148+
Busy: 1
149+
Max: 4096
150+
151+
**NOTE:** *A status index of 0 will never be returned. This allows user
152+
programs to have an index that can be used on error cases.*
153+
154+
Status Bits
155+
^^^^^^^^^^^
156+
The byte being checked will be non-zero if anything is attached. Programs can
157+
check specific bits in the byte to see what mechanism has been attached.
158+
159+
The following values are defined to aid in checking what has been attached:
160+
161+
**EVENT_STATUS_FTRACE** - Bit set if ftrace has been attached (Bit 0).
162+
163+
**EVENT_STATUS_PERF** - Bit set if perf/eBPF has been attached (Bit 1).
164+
165+
Writing Data
166+
------------
167+
After registering an event the same fd that was used to register can be used
168+
to write an entry for that event. The write_index returned must be at the start
169+
of the data, then the remaining data is treated as the payload of the event.
170+
171+
For example, if write_index returned was 1 and I wanted to write out an int
172+
payload of the event. Then the data would have to be 8 bytes (2 ints) in size,
173+
with the first 4 bytes being equal to 1 and the last 4 bytes being equal to the
174+
value I want as the payload.
175+
176+
In memory this would look like this::
177+
178+
int index;
179+
int payload;
180+
181+
User programs might have well known structs that they wish to use to emit out
182+
as payloads. In those cases writev() can be used, with the first vector being
183+
the index and the following vector(s) being the actual event payload.
184+
185+
For example, if I have a struct like this::
186+
187+
struct payload {
188+
int src;
189+
int dst;
190+
int flags;
191+
};
192+
193+
It's advised for user programs to do the following::
194+
195+
struct iovec io[2];
196+
struct payload e;
197+
198+
io[0].iov_base = &write_index;
199+
io[0].iov_len = sizeof(write_index);
200+
io[1].iov_base = &e;
201+
io[1].iov_len = sizeof(e);
202+
203+
writev(fd, (const struct iovec*)io, 2);
204+
205+
**NOTE:** *The write_index is not emitted out into the trace being recorded.*
206+
207+
EBPF
208+
----
209+
EBPF programs that attach to a user-based event tracepoint are given a pointer
210+
to a struct user_bpf_context. The bpf context contains the data type (which can
211+
be a user or kernel buffer, or can be a pointer to the iovec) and the data
212+
length that was emitted (minus the write_index).
213+
214+
Example Code
215+
------------
216+
See sample code in samples/user_events.

0 commit comments

Comments
 (0)