Linux性能调优

一、内核参数文档的查看

1. 在Linux系统中所有的内核参数都可以查询到，只需要安装官方安装包kernel-doc，先学习下内核参数帮助文档的使用

安装kernel-doc

1	[root@student02 ~]# yum -y install kernel-doc

所有的内核参数都可以查询文档

1	[root@student02 ~]# ls /usr/share/doc/kernel-doc-3.10.0/Documentation/

可以查询vm.drop_caches参数的文档，在文件vm.txt的182行记录echo 3 可以清空内存里面的缓存

[root@student02 ~]# grep -irn --color=auto vm.drop_caches /usr/share/doc/kernel-doc-3.10.0/Documentation/
/usr/share/doc/kernel-doc-3.10.0/Documentation/sysctl/vm.txt:178:   echo 1 > /proc/sys/vm/drop_caches
/usr/share/doc/kernel-doc-3.10.0/Documentation/sysctl/vm.txt:180:   echo 2 > /proc/sys/vm/drop_caches
/usr/share/doc/kernel-doc-3.10.0/Documentation/sysctl/vm.txt:182:   echo 3 > /proc/sys/vm/drop_caches

二、性能调优工具

1. tuned

安装tuned启动服务

1
2
3

[root@student02 ~]# yum -y install tuned
[root@student02 ~]# systemctl start tuned
[root@student02 ~]# systemctl enable tuned

查看系统中对于不同应用场景的调优方案

[root@student02 ~]# tuned-adm list
Available profiles:
- balanced                    - General non-specialized tuned profile
- desktop                     - Optmize for the desktop use-case
- latency-performance         - Optimize for deterministic performance at the cost of increased power consumption
- network-latency             - Optimize for deterministic performance at the cost of increased power consumption, focused on low latency network performance
- network-throughput          - Optimize for streaming network throughput.  Generally only necessary on older CPUs or 40G+ networks.
- powersave                   - Optimize for low power consumption
- throughput-performance      - Broadly applicable tuning that provides excellent performance across a variety of common server workloads.  This is the default profile     for RHEL7.
- virtual-guest               - Optimize for running inside a virtual guest.
- virtual-host                - Optimize for running KVM guests
Current active profile: virtual-guest

查看当前的调优方案

1 2	[root@student02 ~]# tuned-adm active Current active profile: virtual-guest

balanced：一般的非专业的调优配置

desktop：优化桌面使用环境

latency-performance：以增加功耗为代价优化确定的性能

network-latency：以增加功耗为代价优化确定的性能，重点是低延迟网络性能

network-throughput：优化网络吞吐量，一般只需要用在旧的CPU和40G+的网络中

powersave：低功率的优化

throughput-performance：广泛适用的调整，可在各种常见服务器工作负载下提供出色的性能，这是RHEL7的默认配置文件。

virtual-guest：优化运行在主机内部的虚拟机

virtual-host：优化运行KVM虚拟机的主机

使用高性能的配置策略

1
2
3

[root@student02 ~]# tuned-adm profile throughput-performance 
[root@student02 ~]# tuned-adm active 
Current active profile: throughput-performance

tuned调优的配置文件，也可以自己添加一个目录，修改成自己适合的配置

1 2	[root@student02 ~]# ls /usr/lib/tuned/ balanced desktop functions latency-performance network-latency network-throughput powersave recommend.conf throughput-performance virtual-guest virtual-host

2. limits

limits的配置文件

1	[root@student02 ~]# vim /etc/security/limits.conf

硬限制用户只能打开2048个文件

[root@student02 ~]# vim /etc/security/limits.conf
student          hard    nofile          2048
[student@student02 ~]$ ulimit -n
2048

硬限制用户只能使用10M虚拟内存

[root@student02 ~]# vim /etc/security/limits.conf
student          hard    as              102400
[I have no name!@student02 ~]$ ulimit -v
102400

查看limits对我的所有限制

[root@student02 ~]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 3821
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 3821
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

软限制用户只能使用100M虚拟内存，用户可以在硬限制范围内修改软限制

[root@student02 ~]# vim /etc/security/limits.conf
student          soft    as              102400
[root@student02 ~]# ulimit -v
10240
[root@student02 ~]# ulimit -v 20480

3. cgroup

安装cgroup启动服务

[root@student02 ~]# yum -y install libcgroup-tools
[root@student02 ~]# systemctl start cgconfig
[root@student02 ~]# systemctl enable cgconfig
[root@student02 ~]# systemctl start cgred
[root@student02 ~]# systemctl enable cgred

启动服务后会产生一个cgroup的临时目录，查看对各种资源限制的配置，每个目录里面还有很多限制子选项

[root@student02 ~]# df -h |grep cgroup
tmpfs       489M     0  489M   0% /sys/fs/cgroup
[root@student02 cgroup]# ls
blkio  cpuacct      cpuset   freezer  memory   net_cls,net_prio  perf_event  systemd    cpu    cpu,cpuacct  devices  hugetlb  net_cls  net_prio  pids

查看参数名，cgconfig.conf中创建一个组test，限制磁盘读bps为1M，内存使用256M

[root@student02 ~]# ll /dev/sda
brw-rw---- 1 root disk 8, 0 Jun  1 20:43 /dev/sda
[root@student02 ~]# ls /sys/fs/cgroup/blkio/test/ |grep read_bps
blkio.throttle.read_bps_device
[root@student02 ~]# ls /sys/fs/cgroup/memory/ |grep memory.limit
memory.limit_in_bytes
[root@student02 ~]# vim /etc/cgconfig.conf
group test {
blkio {
blkio.throttle.read_bps_device = "8:0 1048576";
}
memory {
memory.limit_in_bytes = "256M";
}
}
[root@student02 ~]# systemctl restart cgconfig
[root@student02 ~]# cat /sys/fs/cgroup/blkio/test/blkio.throttle.read_bps_device 
8:0 1048576
[root@student02 ~]# cat /sys/fs/cgroup/memory/test/memory.limit_in_bytes 
268435456
[root@student02 ~]# systemctl restart cgconfig

cgrules.conf中配置策略，用户student的blkio和memory应用test组策略的限制

1
2
3

[root@student02 ~]# vim /etc/cgrules.conf
student         blkio,memory    test/
[root@student02 ~]# systemctl restart cgred

也可以具体限制到命令

1
2
3

[root@student02 ~]# vim /etc/cgrules.conf
student:cp      blkio,memory    test/
[root@student02 ~]# systemctl restart cgred

三、系统状态监控工具

1. iostat命令，监控系统设备的IO负载情况

不接参数，显示系统启动以来的i/o状态的平均结果

1
2
3

[root@student02 ~]# iostat
avg-cpu:  %user   %nice  %system %iowait  %steal   %idle
0.26    0.03    0.42    0.28    0.00   99.01

%user：显示在用户级别(application)运行使用 CPU 总时间的百分比。

%nice：显示在用户级别，用于nice操作，所占用 CPU 总时间的百分比。

%system：在核心级别(kernel)运行所使用 CPU 总时间的百分比。

%iowait：显示用于等待I/O操作占用 CPU 总时间的百分比。

%steal：管理程序(hypervisor)为另一个虚拟进程提供服务而等待虚拟 CPU 的百分比。

%idle：显示 CPU 空闲时间占用 CPU 总时间的百分比。

若 %iowait 的值过高，表示硬盘存在I/O瓶颈

若 %idle 的值高但系统响应慢时，有可能是 CPU 等待分配内存，此时应加大内存容量

若 %idle 的值持续低于1，则系统的 CPU处理能力相对较低，表明系统中最需要解决的资源是 CPU 。

[root@student02 ~]# iostat 1 #1秒钟显示一次，一直监控
[root@student02 ~]# iostat 1 2 #1秒钟显示一次，监控两次
[root@student02 ~]# iostat 1 2 sda #只显示sda磁盘
[root@student02 ~]# dd if=/dev/zero of=/dev/null & #后台运行dd命令
[root@student02 ~]# iostat 1#发现系统内核占用CPU资源很大
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
22.77    0.00   77.23    0.00    0.00    0.00
[root@student02 ~]# dd if=/dev/urandom of=/dev/null & #写入随机数
[root@student02 ~]# iostat 1 #CPU被系统占用完了
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0.00    0.00  100.00    0.00    0.00    0.00
[root@student02 ~]# dd if=/dev/zero of=/tmp/test bs=1M count=1000 & #往磁盘写入文件
[root@student02 ~]# iostat 1 #系统和iowait消耗CPU资源比较多
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0.00    0.00   54.55   36.36    0.00    9.09
Device:     tps     kB_read/s    kB_wrtn/s    kB_read    kB_wrtn #以block为单位（512字节）
sda         554.55     10909.09    196763.64       1200      21644

Device：磁盘设备

tps：多少I/O请求数/s

kB_read/s：读多少block/s

kB_wrtn/s：写多少block/s

kB_read：一共读block

kB_wrtn：一共写block

[root@student02 ~]# echo "(10909+196763)*512/1024/1024" |bc
101 #计算出I/O速度为101M每秒
[root@student02 ~]# iostat -x #显示更多的参数
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

kB_wrtn/s除以tps就可以得到每个I/O的大小，可以通过I/O大小判断组建哪种RAID，RAID5适合顺序的大I/O，RAID10适合随机的小I/

2. sar命令，对系统的活动进行报告

参数介绍：

-A：所有报告的总和

-u：输出CPU使用情况的统计信息

-v：输出inode、文件和其他内核表的统计信息

-d：输出每一个块设备的活动信息

-r：输出内存和交换空间的统计信息

-b：显示I/O和传送速率的统计信息

-a：文件读写情况

-c：输出进程统计信息，每秒创建的进程数

-R：输出内存页面的统计信息

-y：终端设备活动情况

-w：输出系统交换活动信息

查看CPU负载信息

1
2
3

[root@student02 ~]# sar -u 1 10
04:56:39 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
04:56:40 PM     all      0.00      0.00      0.00      0.00      0.00    100.00

%user：显示在用户级别(application)运行使用 CPU 总时间的百分比。

%nice：显示在用户级别，用于nice操作，所占用 CPU 总时间的百分比。

%system：在核心级别(kernel)运行所使用 CPU 总时间的百分比。

%iowait：显示用于等待I/O操作占用 CPU 总时间的百分比。

%steal：管理程序(hypervisor)为另一个虚拟进程提供服务而等待虚拟 CPU 的百分比。

%idle：显示 CPU 空闲时间占用 CPU 总时间的百分比。

若 %iowait 的值过高，表示硬盘存在I/O瓶颈

若 %idle 的值高但系统响应慢时，有可能是 CPU 等待分配内存，此时应加大内存容量

若 %idle 的值持续低于1，则系统的 CPU处理能力相对较低，表明系统中最需要解决的资源是 CPU 。

以24小时制来显示时间

1	[root@student02 ~]# alias sar='LANG=C sar'

内存和交换空间监控

1
2
3

[root@student02 ~]# sar -r 1 10
17:02:29    kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
17:02:30       688816    311124     31.11        12    177876    221100      7.14    110672    103680         4

kbmemfree：这个值和free命令中的free值基本一致,所以它不包括buffer和cache的空间.

kbmemused：这个值和free命令中的used值基本一致,所以它包括buffer和cache的空间.

%memused：这个值是kbmemused和内存总量(不包括swap)的一个百分比.

kbbuffers和kbcached：这两个值就是free命令中的buffer和cache.

kbcommit：保证当前系统所需要的内存,即为了确保不溢出而需要的内存(RAM+swap).

%commit：这个值是kbcommit与内存总量(包括swap)的一个百分比.

[root@student02 ~]# sar -d 1 10 #显示磁盘负载信息以及10s的平均负载
[root@student02 ~]# sar -u 1 10 #显示CPU负载信息以及10s的平均负载
[root@student02 ~]# sar -r 1 10 #显示内存负载以及10s的平均负载
[root@student02 ~]# cat /etc/cron.d/sysstat |egrep -v "^#|^$" #系统本来就有sar的计划任务
*/10 * * * * root /usr/lib64/sa/sa1 1 1
53 23 * * * root /usr/lib64/sa/sa2 -A
[root@student02 ~]# ls /var/log/sa/ #计划任务记录每天的记录
sa01  sa22  sa23  sa24  sa25  sa26  sa27  sa30  sa31  sar22
[root@student02 ~]# sar -f /var/log/sa/sa01 #查看文件中保存的记录
[root@student02 ~]# cat /var/log/sa/sar01 #查看一天的汇总记录
[root@student02 ~]# sar -u -s 15:00:00 -e 16:00:00 -f /var/log/sa/sa01 #查看时间范围内的记录

3. vmstat命令，服务器的CPU使用率，内存使用，虚拟内存交换情况,IO读写情况

[root@student02 ~]# vmstat 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1  0      0 835992     12  84764    0    0  8890   160   83   71  0  3 96  1  0

r：表示运行队列，多少个进程分配到cpu，这个值超过了CPU数目就会出现CPU瓶颈，如果运行队列过大，表示你的CPU很繁忙，一般会造成CPU使用率很高。

b：表示阻塞的进程

swpd：虚拟内存已使用的大小，如果大于0，表示你的机器物理内存不足了。

free：空闲的物理内存的大小。

buff：存储文件列表，权限等的缓存。

cache：文件内容的缓存。

si：每秒从磁盘读入虚拟内存的大小。

so：每秒虚拟内存写入磁盘的大小。

bi：块设备每秒接收的块数量。

bo：块设备每秒发送的块数量，读取文件，bo就会大于0。

in：每秒CPU的中断次数，包括时间中断。

cs：CPU每秒上下文切换次数，调用系统函数，线程的切换要进行上下文切换，这个值要越小越好，太大了，要考虑调低线程或者进程的数目，web服务器中需要调节这个值最小。

us：用户CPU时间

sy：系统CPU时间

id：空闲CPU时间

wt：I/O等待CPU时间

4. free

free命令是linux的一个入门级命令，显示的是一个比较总述性的信息，如下：

[root@student02 ~]# free -m
total        used        free      shared  buff/cache   available
Mem:            976         132         600           6         242         626
Swap:          2047           0        2047

比如上面的输出中，我们大致可以看到我总内存为1G（995，其中hardward和firmware在启动时会预先占用一点）。其中已使用874M ，其中可用121M，buffers和cached加用的量为105M + 249M ，这里需要注意的是平时我们在查看时，一般会以第二行的结果为准，即实际可用477M，已用518M。为什么这样说？因为buffers和cached是为了加快运算速度，会预占用一部分内存，可以理解为缓存的概念。由于这部分不是本篇的重点，想深究的可以找谷歌。这部分内存可以通过如下的命令进行回收：

1 2	[root@student02 ~]# sysctl -w vm.drop_caches=3 vm.drop_caches = 3

在有业务运行的情况下，强烈不建议这样操作，因为可能会造成数据丢失。

5. top

即然内存被使用，到底被谁占去了呢？可以借助强大的top查看。

[root@student02 ~]# top
top - 11:54:31 up  2:26,  1 user,  load average: 0.00, 0.01, 0.05
Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :   999940 total,   824676 free,    83216 used,    92048 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used.   774152 avail Mem 

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND             
879 root      20   0  553152  16424   5784 S  0.0  1.6   0:02.76 tuned               
647 polkitd   20   0  527516  11984   4580 S  0.0  1.2   0:00.18 polkitd             
671 root      20   0  437488   7848   6068 S  0.0  0.8   0:00.19 NetworkManager

在top下我们输入大M就可以按内存使用率排序。上面可以看到我内存主要被hhvm进程占用掉了，占比总内存的34.3% 。可以看到，实际上top上面也有free的功能，对memory会有概述性报告的。其中RES是我们要关注的项，即实际该进程占用的内存量，基本上这样我们就定位到内存用到那去了。现网中经常还需要一种情况，top看到的所有进程的RES使用都不大，而内存一下子少了几十G，这个怎么破呢？看下面。

6. /proc/meminfo

meminfo文件显示出的也是内存的概述性信息，只不过其比free -m的结果要更详细，如下：

[root@361way ~]# cat /proc/meminfo
MemTotal:        1019644 kB    所有可用RAM大小 （即物理内存减去一些预留位和内核的二进
制代码大小）
MemFree:          119464 kB    LowFree与HighFree的总和，被系统留着未使用的内存
Buffers:          110680 kB    用来给文件做缓冲大小
Cached:           256796 kB    被高速缓冲存储器（cache memory）用的内存的大小（等于
diskcache +　SwapCache ）
SwapCached:            0 kB　　　
Active:           680560 kB　　在活跃使用中的缓冲或高速缓冲存储器页面文件的大小，除非非常必要否则不会被移作他用
Inactive:         145228 kB    在不经常使用中的缓冲或高速缓冲存储器页面文件的大小，可能被用于其他途径
Active(anon):     458208 kB
Inactive(anon):      280 kB
Active(file):     222352 kB
Inactive(file):   144948 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                92 kB   脏页，等待被写回到磁盘的内存大小
Writeback:             0 kB   正在被写回到磁盘的内存大小
AnonPages:        458328 kB   未映射页的内存大小
Mapped:            27072 kB   设备和文件等映射的大小
Shmem:               176 kB
Slab:              53564 kB   内核数据结构缓存的大小，可以减少申请和释放内存带来的消耗
SReclaimable:      32404 kB   可收回Slab的大小
SUnreclaim:        21160 kB   不可收回Slab的大小（SUnreclaim+SReclaimable＝Slab）
KernelStack:        1136 kB   内核栈大小占用的内存
PageTables:         5856 kB   管理内存分页页面的索引表的大小
NFS_Unstable:          0 kB   不稳定页表的大小
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      509820 kB
Committed_AS:    1483868 kB
VmallocTotal:   34359738367 kB  可以vmalloc虚拟内存大小
VmallocUsed:        7472 kB     已经被使用的虚拟内存大小
VmallocChunk:   34359728764 kB
HardwareCorrupted:     0 kB
AnonHugePages:    198656 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7168 kB
DirectMap2M:     1044480 kB

7. 缓存命中率cache-hit

安装valgrind

1	[root@student02 ~]# yum -y install valgrind

查看命令或者脚本的cache-miss

[root@student02 ~]# valgrind --tool=cachegrind ls
......
==2005== D1  miss rate:     1.5% (    2.0%     +     0.8%  )
==2005== LLd miss rate:     1.0% (    1.2%     +     0.7%  )
......

8. strace用来跟踪一个进程的系统调用或信号产生的情况

查看复制文件的状态跟踪

[root@student02 ~]# strace -fc cp /etc/passwd /tmp/
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
25.08    0.000234           9        25           mmap
19.40    0.000181          12        15           close
17.58    0.000164          10        16           mprotect
13.08    0.000122          10        12           open
11.25    0.000105          10        11           read
3.22    0.000030           3        12           fstat
3.00    0.000028          14         2         1 access
2.68    0.000025          13         2           munmap
2.36    0.000022          22         1           execve
0.54    0.000005           3         2           rt_sigaction
0.32    0.000003           3         1           rt_sigprocmask
0.32    0.000003           3         1           getrlimit
0.32    0.000003           3         1           arch_prctl
0.32    0.000003           3         1           set_tid_address
0.32    0.000003           3         1           set_robust_list
0.21    0.000002           1         3           brk
0.00    0.000000           0         1           write
0.00    0.000000           0         4         2 stat
0.00    0.000000           0         1         1 lseek
0.00    0.000000           0         1           geteuid
0.00    0.000000           0         2         2 statfs
0.00    0.000000           0         1           fadvise64
------ ----------- ----------- --------- --------- ----------------
100.00    0.000933                   116         6 total

可以具体到查看命令打开了什么文件

[root@student02 ~]# strace -e open cp /etc/passwd /tmp/
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpcre.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/proc/filesystems", O_RDONLY)     = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
open("/etc/passwd", O_RDONLY)           = 3
open("/tmp/passwd", O_WRONLY|O_TRUNC)   = 4
+++ exited with 0 +++

9. ltrace用来跟踪进程调用库函数的情况

查看命令对系统和库的调用

[root@student02 ~]# ltrace -Sfc cp /etc/passwd /tmp/
% time     seconds  usecs/call     calls      function
------ ----------- ----------- --------- --------------------
42.06    0.013412       13412         1 __libc_start_main
15.75    0.005023        5023         1 exit
3.95    0.001261        1261         1 exit_group
3.22    0.001027         342         3 __xstat
2.97    0.000946         946         1 __errno_location
2.82    0.000900          52        17 close
2.81    0.000897         224         4 free
2.51    0.000801         160         5 __freading
2.40    0.000766         255         3 fclose
1.93    0.000614         153         4 strlen
1.80    0.000575         143         4 fileno
1.55    0.000494          35        14 open
1.33    0.000424         141         3 brk
1.14    0.000362         181         2 fflush
1.10    0.000350          26        13 read
1.07    0.000340         340         1 posix_fadvise
1.03    0.000327         109         3 malloc
......
------ ----------- ----------- --------- --------------------
100.00    0.031885                   176 total

四、文件系统的调优

1. tune2fs

查看文件系统信息

[root@student02 ~]# tune2fs -l /dev/sda3 
tune2fs 1.42.9 (28-Dec-2013)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          b1b4ea64-5a17-421c-b022-3ac84c14a1e6
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype     needs_reco
very extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink     extra_isizeFilesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              65536
Block count:              262144
Reserved block count:     13107
Free blocks:              249189
Free inodes:              65525
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      127
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Fri Jun  2 17:01:20 2017
Last mount time:          Fri Jun  2 17:01:25 2017
Last write time:          Fri Jun  2 17:01:25 2017
Mount count:              1
Maximum mount count:      -1
Last checked:             Fri Jun  2 17:01:20 2017
Check interval:           0 (<none>)
Lifetime writes:          33 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      448602f1-b2c4-4522-a428-3746969036d6
Journal backup:           inode blocks

df查看分区有一部分空间（976M-2.6M-907M=66.4M）看不到

[root@student02 ~]# df -h |grep sda3
/dev/sda3            976M  2.6M  907M   1% /mnt
[root@student02 ~]# tune2fs -l /dev/sda3 |grep "Reserved block count"
Reserved block count:     26214

Reserved GDT blocks：每个分区5%的block会被保留

修改分区保留比例为1%

1 2	[root@student02 ~]# tune2fs -m 10 /dev/sda3 Setting reserved blocks percentage to 1% (26214 blocks)

修改磁盘的UUID

[root@student02 ~]# blkid |grep sda3
/dev/sda3: UUID="b1b4ea64-5a17-421c-b022-3ac84c14a1e7" TYPE="ext4" 
[root@student02 ~]# tune2fs -U b1b4ea64-5a17-421c-b022-3ac84c14a1e8 /dev/sda3
[root@student02 ~]# blkid |grep sda3
/dev/sda3: UUID="b1b4ea64-5a17-421c-b022-3ac84c14a1e8" TYPE="ext4"

2. dumpe2fs，查看文件系统结构

[root@student02 ~]# dumpe2fs /dev/sda3 
dumpe2fs 1.42.9 (28-Dec-2013)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          b1b4ea64-5a17-421c-b022-3ac84c14a1e6
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype     needs_reco
very extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink     extra_isizeFilesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              65536
Block count:              262144
Reserved block count:     13107
Free blocks:              249189
Free inodes:              65525
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      127
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Fri Jun  2 17:01:20 2017
Last mount time:          Fri Jun  2 17:01:25 2017
Last write time:          Fri Jun  2 17:01:25 2017
Mount count:              1
Maximum mount count:      -1
Last checked:             Fri Jun  2 17:01:20 2017
Check interval:           0 (<none>)
Lifetime writes:          33 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      448602f1-b2c4-4522-a428-3746969036d6
Journal backup:           inode blocks
Journal features:         journal_64bit
Journal size:             32M
Journal length:           8192
Journal sequence:         0x00000002
Journal start:            1


Group 0: (Blocks 0-32767) [ITABLE_ZEROED]
Checksum 0x5492, unused inodes 8181
Primary superblock at 0, Group descriptors at 1-1
Reserved GDT blocks at 2-128
Block bitmap at 129 (+129), Inode bitmap at 145 (+145)
Inode table at 161-672 (+161)
28521 free blocks, 8181 free inodes, 2 directories, 8181 unused inodes
Free blocks: 142-144, 153-160, 4258-32767
Free inodes: 12-8192
Group 1: (Blocks 32768-65535) [INODE_UNINIT, ITABLE_ZEROED]
Checksum 0xb3fe, unused inodes 8192
Backup superblock at 32768, Group descriptors at 32769-32769
Reserved GDT blocks at 32770-32896
Block bitmap at 130 (bg #0 + 130), Inode bitmap at 146 (bg #0 + 146)
Inode table at 673-1184 (bg #0 + 673)
32639 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
Free blocks: 32897-65535
Free inodes: 8193-16384
......

Journal size：日志区大小

Group x：文件系统会把block分成很多个group，每个group的大小为32768个block，当文件系统在使用的时候每个group都是平均使用的，不会使用完，当已经写入的文件需要扩大的时候可以就文件当前的group上面分配空间，让文件尽可能连续，不必到距离很远的block分配空间，可以提升文件读取性能。

superblock：superblock在第0个group上，第1、3、 5、7、9个group上面有superblock的备份

Group descriptors：每个group的第一个block存放group的描述信息

Reserved GDT blocks：每个group有127个保留的block

Block bitmap：记录block的位图区

Inode bitmap：记录inode索引文件的位图区

Inode table：inode索引表

Free blocks：未使用的block

Free inodes：未使用的inode

3. 修复文件系统

破坏超级快和block保留区，挂载失败

[root@student02 ~]# dd if=/dev/zero of=/dev/sda3 bs=1K count=516
[root@student02 ~]# mount /dev/sda3 /mnt/
mount: /dev/sda3 is write-protected, mounting read-only
mount: unknown filesystem type '(null)'

用fsck修复文件系统

1	[root@student02 ~]# fsck -v /dev/sda3

有的情况下fsck修复不了可以使用e2fsck指定超级快修复

1	[root@student02 ~]# e2fsck -b 98304 /dev/sda3

4. 非日志型文件系统与日志型文件系统

非日志型文件系统：fat32、ext2

文件系统无日志区，文件写入过程：

1
2
3

graph LR
文件索引写入到inode-->文件内容写入到block
文件内容写入到block-->文件写入完成

在非日志型文件系统中，对文件系统实施一个写操作，内核会首先修改对应的元数据，然后修改数据块。如果在写入元数据时，文件系统发生崩溃或某种故障，那么数据的一致性将会遭到破坏。fsck命令可以在下次重启时检查所有的元数据并修复数据一致性，但是如果文件系统非常大，或者系统运行关键业务不允许停机，使用非日志型文件系统的风险会非常高。

日志型文件系统：ntfs、ext3、ext4、xfs、jfs

文件系统有日志区，文件写入过程：

graph LR
文件索引写入到journal-->文件内容写入到block
文件内容写入到block-->文件索引写入到inode
文件索引写入到inode-->文件写入完成

日志型文件系统的区别在于，在进行对文件系统写数据之前，写将数据写到「日志区」，然后再写入文件系统，在写入文件系统之后删除日志。日志区可以在文件系统内部也可以在文件系统外部。日志区的数据称作文件系统日志，这些数据包含了修改了的元数据，也可能包含将要修改的数据。写日志也会带来一定的额外开销。

查看文件系统的日志区大小

1 2	[root@student02 ~]# dumpe2fs /dev/sda3 \|grep "Journal size" Journal size: 32M

5. 内部日志区和外部日志区

内部日志区，日志区在当前文件系统，会对文件系统进行两次索引信息写入

外部日志区，单独一个文件系统做日志区，减少访问次数，减少服务时间

创建外部日志区

[root@student02 ~]# umount /dev/sda3 #先卸载文件系统
[root@student02 ~]# cat /proc/partitions |grep sda4 #新建128M的分区
[root@student02 ~]# tune2fs -O ^has_journal /dev/sda3 #去掉文件系统内部日志区功能
[root@student02 ~]# mke2fs -O journal_dev -b 4096 /dev/sda4 #把分区格式化成日志区设备
[root@student02 ~]# tune2fs -j -J device=/dev/sda4 /dev/sda3 #把sda4设置为sda3的外部日志区
Creating journal on device /dev/sda4: done

不更新atime，适用于频繁读取文件的业务减少磁盘写

1
2
3

[root@student02 ~]# mount -o remount,noatime /home/
[root@student02 ~]# mount |grep /home
/dev/mapper/cl-home on /home type xfs (rw,noatime,attr2,inode64,noquota)

6. cache

磁盘写入数据，数据先写到硬盘的缓存中，进行io的聚合排序后，再从缓存写到硬盘中，电梯算法

修改磁盘缓存的队列长度，队列长度越长，io得到更多的聚合，但是更消耗内存。

1
2
3

[root@student02 ~]# cat /sys/block/sda/queue/nr_requests 
128
[root@student02 ~]# echo 256 > /sys/block/sda/queue/nr_requests

7. 读策略

读预取的作用：

在处理IO请求时，从磁盘按顺序读取IO数据以外更多的数据，预先缓存到cache中，以便下一个顺序读IO请求到达时，可直接在cache中获取，得到更高的性能表现。
当读IO很随机是，不当的读预取策略会给存储系统带来额外的资源开销，不到无法保证后续IO再cache中的命中，而且还会带来性能的降低。

四种预取算法：固定预取、可变预取、智能预取和不预取。

固定预取：固定大小数据的预取。
可变预取：固定IO数量的预取。
智能预取：顺序IO的时候就预取，随机IO的时候就不预取，默认预取算法。
不预取：不预取数据。

默认读预取大小

1 2	[root@student02 ~]# cat /sys/block/sda/queue/read_ahead_kb 4096

查询读预取扇区数

1 2	[root@student02 ~]# blockdev --getra /dev/sda3 8192

修改读预取扇区数

[root@student02 ~]# blockdev --setra 4096 /dev/sda3 
[root@student02 ~]# cat /sys/block/sda/queue/read_ahead_kb 
2048
[root@student02 ~]# blockdev --getra /dev/sda3 
4096

想要重启生效，需要写到开机脚本里面

1	[root@student02 ~]# echo "blockdev --setra 4096 /dev/sda3" >>/etc/rc.local

8. scheduler，磁盘调度算法，可以搭配tuned来使用

三种调度算法：

deadline：最终期限，适合小IO，一个IO读500ms或者写了5000ms没有完成，就放到队列里面排队，优化了等待时间。
noop：不使用任何参数
cfq：完全公平，轮询，默认算法

查看当前的调度算法

1 2	[root@student02 ~]# cat /sys/block/sda/queue/scheduler noop [deadline] cfq

修改调度算法

1
2
3

[root@student02 ~]# echo deadline > /sys/block/sda/queue/scheduler 
[root@student02 ~]# cat /sys/block/sda/queue/scheduler
noop [deadline] cfq

deadline的算法参数front_merges、read_expire、write_expire

1 2	[root@student02 ~]# ls /sys/block/sda/queue/iosched/ fifo_batch front_merges read_expire write_expire writes_starved

查询内核参数

1 2	[root@student02 kernel-doc-3.10.0]# grep -irn --color=auto ^read_expire . ./Documentation/block/deadline-iosched.txt:17:read_expire (in ms)

cfq模式下可以用ionice做进程优先级微调

ionice的级别

class1，实时的，子级别有0-7，数字越大优先级越低
class2，尽力的，子级别有0-7，数字越大优先级越低
class3，空闲的

修改优先级

[root@student02 ~]# pidof vsftpd
1805
[root@student02 ~]# ionice -p 1805 -c 2 -n 0
[root@student02 ~]# ionice -p 1805
best-effort: prio 0

五、Linux模块的调优

1. 列出系统中加载的模块，查找ext4文件系统模块

1
2
3

[root@student02 ~]# lsmod |grep ^ext4
Module    Size  Used by
ext4                   985347  3

Module：模块名
Size：模块大小
Used：正在使用次数
by：描述

2. 查看模块驱动的详细信息

[root@student02 ~]# modinfo st
filename:       /lib/modules/3.10.0-514.16.1.el7.x86_64/kernel/drivers/scsi/st.ko
alias:          scsi:t-0x01*
alias:          char-major-9-*
license:        GPL
description:    SCSI tape (st) driver
author:         Kai Makisara
rhelversion:    7.3
srcversion:     5C10DC43BF58E63A59D8660
depends:        
intree:         Y
vermagic:       3.10.0-514.16.1.el7.x86_64 SMP mod_unload modversions 
signer:         CentOS Linux kernel signing key
sig_key:        3F:E1:EB:8B:4F:91:D4:84:CD:55:44:84:54:A0:24:DE:56:34:E1:06
sig_hashalgo:   sha256
parm:           buffer_kbs:Default driver buffer size for fixed block mode (KB; 32)     (int)
parm:           max_sg_segs:Maximum number of scatter/gather segments to use (256) (int)
parm:           try_direct_io:Try direct I/O between user buffer and tape drive (1)     (int)
parm:           debug_flag:Enable DEBUG, same as setting debugging=1 (int)
parm:           try_rdio:Try direct read i/o when possible (int)
parm:           try_wdio:Try direct write i/o when possible (int)

3. 手动加载模块

1
2
3

[root@student02 ~]# modprobe st
[root@student02 ~]# lsmod |grep st
st                     54238  0

4. 手动卸载模块

1
2
3

[root@student02 ~]# modprobe -r st
[root@student02 ~]# lsmod |grep ^st
[root@student02 ~]#

5. 查看模块驱动的参数fixed_buffer_size

[root@student02 st]# modinfo st |grep parm |grep buffer_kbs
parm:           buffer_kbs:Default driver buffer size for fixed block mode (KB; 32) (int
[root@student02 ~]# cd /sys/bus/scsi/drivers/st/
[root@student02 st]# echo "`cat fixed_buffer_size`/1024" |bc
32

#### 6. 修改st模块的固定缓存，写到模块的配置文件中
```bash
[root@student02]# echo "options st buffer_kbs=128" > /etc/modprobe.d/st.conf
[root@student02]# modprobe -r st
[root@student02]# modprobe st
[root@student02 ~]# echo "`cat /sys/bus/scsi/drivers/st/fixed_buffer_size`/1024" |bc
128

六、内存的调优

1. 当在Linux下频繁存取文件后，cached占用很多资源，虽然cache是为了提高文件读取效率，在有需求的情况下通过命令可以清空缓存

先用free命令查看缓存的使用情况

[root@student02 ~]# free -m
total        used        free      shared  buff/cache   available
Mem:            976         131          76           6         768         649
Swap:          2047           0        2047

看到buff/cache占用了768M的内存空间，在清空缓存前用sync命令把dirty页的内容写回硬盘，以免数据丢失

1	[root@student02 ~]# sync

现在可以尝试使用内核参数vm.drop_caches来清空缓存

1 2	[root@student02 ~]# sysctl -w vm.drop_caches=3 vm.drop_caches = 3

再来看看缓存的情况

[root@student02 ~]# free -m
total        used        free      shared  buff/cache   available
Mem:            976         123         762           6          90         713
Swap:          2047           0        2047

发现已经只有90M的buff/cache，free的内存空间大了很多

2. 虚拟内存与物理内存

查看进程占用的虚拟内存和物理内存

1
2
3

[root@student02 ~]# ps -aux |grep -E "^nginx|RSS" |grep -v grep
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nginx      3167  0.0  0.1  46104  1992 ?        S    17:01   0:00 nginx: worker process

VSZ：申请虚拟内存大小，不是立即使用这么多内存
RSS：实际占用物理内存大小

3. PTE和TLB

PTE，页表条目 (Page Table Entry),是页表的最低层,它直接处理页,该值包含某页的虚拟内存到物理内存的映射关系

TLB，是虚拟寻址的缓存，其中每一行都保存着一个由单个PTE(Page Table Entry,页表项)组成的块，用于虚拟地址与实地址之间的交互，提供一个寻找实地址的缓存区，能够有效减少寻找物理地址所消耗时间。

hugepage，大页，用来做TLB的

配置hugepage，添加内核参数，分配40M空间

[root@student02 ~]# sysctl -a |grep -w vm.nr_hugepages |sed 's/0/20/' >>/etc/sysctl.conf 
[root@student02 ~]# sysctl -p
vm.nr_hugepages = 20
[root@student02 ~]# cat /proc/meminfo |grep -i "^hugepage"
HugePages_Total:      20
HugePages_Free:       20
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

默认是不启用大页的，大页的空间是立即分配的，不使用也会占用一段内存

4. 页中断

查看进程的次页中断和主页中断

[root@student02 ~]# ps axo comm,pid,min_flt,maj_flt |grep -iE "command|nginx"
COMMAND            PID  MINFL  MAJFL
nginx             4551    100      0
nginx             4552    292      0

打开一个进程，需要去分配内存叫页中断，可以直接在内存中调用叫次页中断，需要去硬盘swap中调用叫主页中断，出现大量主页中断的时候就说明内存存在瓶颈

5. 内存页的状态

内存页的四种状态：
free：是空闲的页，随时可以被使用
inactive clean：未激活的干净的页，页中的数据已经写到磁盘中，从磁盘读到内存中未被修改，这个页也可以分配，例如，cached
inactive dirty：脏页，不能被使用，页中的数据修改未写到磁盘，例如脏页，特别是复制文件的时候，dirty是实时变化的
active：激活的页，正在被进程使用，不能被分配

查看脏页的大小

1 2	[root@student02 ~]# cat /proc/meminfo \|grep -w Dirty Dirty: 4 kB

脏页快速回写的百分比

1 2	[root@student02 ~]# sysctl -a\|grep dirty_ratio vm.dirty_ratio = 40

当脏页占用内存20%以下的时候，低速写到磁盘中，高于20%的时候高速写回磁盘中，大io可以设置比较小，快点写入硬盘，小io可以慢点写入硬盘，得到聚合，存储一般设置比较高，因为存储有掉电保护BBU

脏页的过期时间

1 2	[root@student02 ~]# sysctl -a\|grep dirty_exp vm.dirty_expire_centisecs = 3000

默认30s，30s到了即使比例不达到20%也要快速写到磁盘中

脏页比例的检查时间，5s

1 2	[root@student02 ~]# sysctl -a\|grep dirty_w vm.dirty_writeback_centisecs = 500

也可以用命令sync立即把脏页内容写到硬盘

1	[root@student02 ~]# sync

还有一个文件也可以强制同步，还支持很多其他指令，模拟各种场景，可以查看内核文档

1	[root@student02 ~]# echo s >/proc/sysrq-trigger

当内存紧张的时候倾向于释放cached使用还是去使用swap，值为0到100,0倾向于保留cached使用swap，100倾向于释放cached

1 2	[root@student02 ~]# sysctl -a\|grep swapp vm.swappiness = 10

6. 内存溢出

oom-kill，内存溢出的保护机制

1	[root@student02 ~]# echo -17 >/proc/1161/oom_adj

这个值的取值范围-17到15，值越小，内存溢出不会杀死这个进程，我们可以把重要的进程值设置比较小，不重要的进程设置为15，内存溢出的时候优先杀死这个进程，腾出内存来，重启生效需要写到开机脚本里

关闭oom-kill的功能

1 2	[root@student02 ~]# sysctl -a \|grep -i panic_on_oom vm.panic_on_oom = 0

这个值设置为1，关闭oom-kill

7. 内存泄漏

进程使用了内存，进程结束也不能回收的部分内存，只有重启才能回收泄漏的内存

检查程序的内存泄漏

[root@student02 ~]# valgrind --tool=memcheck vsftpd
......
==1243==    definitely lost: 0 bytes in 0 blocks
==1243==    indirectly lost: 0 bytes in 0 blocks
==1243==      possibly lost: 0 bytes in 0 blocks
......

8. swap

释放swap的空间

1 2	[root@student02 ~]# swapoff -a [root@student02 ~]# swapon -a

新增swap分区

1 2	[root@student02 ~]# mkswap /dev/sda4 [root@student02 ~]# swapon /dev/sda4

查看swap分区

1	[root@student02 ~]# swapon -s

当有几个swap分区时，设置优先级，数字越大优先级越高，优先级一样高，则轮询使用

1 2	[root@student02 ~]# cat /etc/fstab \|grep swap /dev/mapper/cl-swap swap swap defaults,pri=1 0 0

9. 共享内存

系统安装的时候有一个挂载设备/dev/shm，大小是内存的一般，这个设备就是用来做共享内存的，可以把内存当做磁盘使用，有很快的速度，满足一些业务的需求

查看共享内存的大小

1 2	[root@student02 ~]# df -h \|grep shm tmpfs 489M 0 489M 0% /dev/shm

设置shm分区的大小

1
2
3

[root@student02 ~]# cat /etc/fstab |grep shm
tmpfs   /dev/shm    tmpfs   defaults,size=128m
[root@student02 ~]# mount -o remount /dev/shm

squid做反向代理就会使用共享内存

10. 内存复用

1
2
3

[root@student02 ~]# sysctl -a |grep overcommit
vm.overcommit_memory = 0
vm.overcommit_ratio = 50

0代表更多的尝试过量使用，能过量的就给，1总是过量使用内存，不能过量就释放别的进程，可能把系统搞死机，2能申请的内存等于swap加上内存的百分比，设置模式为0和1的时候百分比设置无效，设置模式2才有效

七、cpu与进程的调优

1. irq的均衡

每个设备都有自己的irq号码，查看设备的中断号

[root@student02 ~]# cat /proc/interrupts 
CPU0       CPU1       
0:         21          0   IO-APIC-edge      timer
1:         10          0   IO-APIC-edge      i8042
6:          2          0   IO-APIC-edge      floppy
8:          1          0   IO-APIC-edge      rtc0
9:          0          0   IO-APIC-fasteoi   acpi
12:         16          0   IO-APIC-edge      i8042
14:          0          0   IO-APIC-edge      ata_piix
15:        160          0   IO-APIC-edge      ata_piix
16:          2          0   IO-APIC-fasteoi   vmwgfx
17:       6606        719   IO-APIC-fasteoi   ioc0
18:         12         58   IO-APIC-fasteoi   ens32
......

第一列数字就是设备中断号，我们看到网卡的中断号是18

CPU1对应的数字在增大，CPU1处理网卡的任务

1
2
3

[root@student02 ~]# cat /proc/interrupts |grep -E "CPU|ens32"
CPU0       CPU1       
18:         14       1455   IO-APIC-fasteoi   ens32

设置网卡的亲和性，让CPU0处理网卡的任务

1	[root@student02 ~]# echo 1 > /proc/irq/18/smp_affinity

4U的服务器设置4个CPU同时处理网卡的任务

1	[root@student02 ~]# echo f > /proc/irq/18/smp_affinity

同理，可以设置其他设备的CPU亲和性

1	[root@student02 ~]# echo x > /proc/irq/{irq_number}/smp_affinity

注：x是一个16进制的数值，f的16进制是1111，代表4个CPU同时工作

2. 均衡CPU的访问次数

任务会在所有CPU上面均衡，繁忙的时候100ms均衡一次，空闲的时候1ms均衡一次，用以下命令可以看到cp文件的时候cpu在不停的均衡

1	[root@student02 ~]# watch -n .1 'ps axo comm,pid,psr \|grep cp'

CPU的调度均衡可以平衡CPU的负载，但是也会降低缓存的命中率

如果想提高CPU缓存命中率，可以设置进程运行在某个CPU上面

1
2
3

[root@student02 ~]# taskset -p 1 5993
pid 5993's current affinity mask: 3
pid 5993's new affinity mask: 1

3. cpu的隔离

开机内核启动的时候隔离某个cpu，用来运行单独的业务,可以使用taskset命令

1
2
3

[root@student02 ~]# grep isolcpus /boot/grub2/grub.cfg
linux16 /vmlinuz-3.10.0-514.16.1.el7.x86_64 root=/dev/mapper/cl-root ro rd.lvm.lv=cl/root rd.lvm.lv=
cl/swap rhgb quiet LANG=en_US.UTF-8 isolcpus=1

4. 在线关闭cpu

当服务器某个cpu故障的时候，我们又不能关机更换CPU，这时候就可以使用以下命令关闭故障cpu在线更换

1 2	[root@student02 ~]# echo 0 >/sys/devices/system/cpu/cpu0/online

lscpu命令可以看到有一个cpu被关闭了

[root@student02 ~]# lscpu |head -n 6
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0
Off-line CPU(s) list:  1

5. 调度域，cgroup之cpuset

所谓cpuset,就是在用户空间中操作cgroup文件系统来执行进程与cpu和进程与内存节点之间的绑定，限制进程只能使用子cpuset里面的资源

根cpuset包含所有的资源

[root@student02 ~]# cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-1
[root@student02 ~]# cat /sys/fs/cgroup/cpuset/cpuset.mems 
0

创建一个子cpuset,包含1号cpu，和所有内存，因为虚拟机不支持NUMA，所以只能设置所有内存

1
2
3

[root@student02 ~]# mkdir /sys/fs/cgroup/cpuset/mycpuset/
[root@student02 ~]# echo 1 > /sys/fs/cgroup/cpuset/mycpuset/cpuset.cpus
[root@student02 ~]# echo 0 >/sys/fs/cgroup/cpuset/mycpuset/cpuset.mems

还可以设置某个进程运行在cpuset里面，根cpuset包含了所有的进程

[root@student02 ~]# cat /sys/fs/cgroup/cpuset/tasks 
1
2
3
6
7
......

让nginx进程运行在mycpuset里面，同时根cpuset里面没有这个进程了

[root@student02 ~]# pidof nginx
1698 1697
[root@student02 ~]# echo -e "1698\n1697" >/sys/fs/cgroup/cpuset/mycpuset/tasks
[root@student02 ~]# cat /sys/fs/cgroup/cpuset/tasks |grep -E "169[7|8]"

反过来看进程在哪个cpuset里面

1 2	[root@student02 ~]# cat /proc/1697/cpuset /mycpuset

以上的配置都是重启不会生效的，如果要重启生效需要写到cgroup里面

[root@student02 ~]# cat /etc/cgconfig.conf |grep -A 5 mycpuset
group mycpuset {
cpuset {
cpuset.cpus = "1";
cpuset.mems = "0";
}
}
[root@student02 ~]# cat /etc/cgrules.conf |grep mycpuset
*:nginx     cpuset      mycpuset/
[root@student02 ~]# systemctl restart cgconfig
[root@student02 ~]# systemctl restart cgred

重启nginx服务测试

[root@student02 ~]# nginx -s stop ;nginx
[root@student02 ~]# pidof nginx
2944 2943
[root@student02 ~]# cat /sys/fs/cgroup/cpuset/mycpuset/tasks 
2943
2944

默认情况下多个子cpuset可以同时使用cpu的资源，如果想要一个cpuset独占cpu资源，可以打开这个开关，需要开机生效也要写到

1 2	[root@student02 ~]# echo 1 >/sys/fs/cgroup/cpuset/cpuset.cpu_exclusive [root@student02 ~]# echo 1 >/sys/fs/cgroup/cpuset/mycpuset/cpuset.cpu_exclusive

6. 进程上下文切换cs

上下文切换的理解

context switch过高会导致CPU像个搬运工，频繁在寄存器和运行队列之间奔波，更多的时间花在了线程切换，而不是真正工作的线程上。直接的消耗包括CPU寄存器需要保存和加载，系统调度器的代码需要执行。间接消耗在于多核cache之间的共享数据。

引起上下文切换的原因

当前任务的时间片用完之后，系统CPU正常调度下一个任务；
当前任务碰到IO阻塞，调度线程将挂起此任务，继续下一个任务；
多个任务抢占锁资源，当前任务没有抢到，被调度器挂起，继续下一个任务；
用户代码挂起当前任务，让出CPU时间；
硬件中断；

进程上下文切换的检查

[root@student02 ~]# vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
4  0      0 618300    968 246528    0    0    25     3   30   58  0  0 99  0  0
0  0      0 618284    968 246528    0    0     0     2   31   44  0  0 100  0  0
0  0      0 618284    968 246528    0    0     0     0   28   37  0  1 99  0  0
0  0      0 618284    968 246528    0    0     0     0   23   37  0  0 100  0  0
0  0      0 618284    968 246528    0    0     0     0   33   45  0  1 99  0  0

bi：block in数量
bo：block out数量
in：每称的中断数
cs：每秒的上下文切换

每个进程每秒刷新输出上下文切换情况

[root@student02 ~]# pidstat -w 1 5
Linux 3.10.0-514.16.1.el7.x86_64 (student02)    06/06/2017  _x86_64_    (1 CPU)
11:11:08 AM   UID       PID   cswch/s nvcswch/s  Command
11:11:09 AM     0         3      2.97      0.00  ksoftirqd/0
11:11:09 AM     0         6      0.99      0.00  kworker/u256:0
11:11:09 AM     0         9      3.96      0.00  rcu_sched
11:11:09 AM     0        10      0.99      0.00  watchdog/0
11:11:09 AM     0        41      3.96      0.00  kworker/0:2
11:11:09 AM     0       400     20.79      0.00  xfsaild/dm-0
11:11:09 AM     0       653      9.90      0.00  vmtoolsd
11:11:09 AM     0      1479      0.99      0.99  pidstat

cswch：自愿的上下文切换
nvcswch：非自愿的上下文切换

看出主机上总的上下文件切换的情况

[root@student02 ~]# sar -w 1 5
Linux 3.10.0-514.16.1.el7.x86_64 (student02)    06/06/2017  _x86_64_    (1 CPU)
11:21:09 AM    proc/s   cswch/s
11:21:10 AM      0.00     89.00
11:21:11 AM      0.00     87.76
11:21:12 AM      0.00     77.23
11:21:13 AM      0.00     81.82
11:21:14 AM      0.00     86.00
Average:         0.00     84.34

查看具体进程的每秒上下文切换

1
2
3

[root@student02 ~]# grep ctxt /proc/1501/status 
voluntary_ctxt_switches:    1 #自愿的上下文切换
nonvoluntary_ctxt_switches: 0 #非自愿的上下文切换

cswch/s: 每秒任务主动(自愿的)切换上下文的次数，当某一任务处于阻塞等待时，将主动让出自己的CPU资源。
nvcswch/s: 每秒任务被动(不自愿的)切换上下文的次数，CPU分配给某一任务的时间片已经用完，因此将强迫该进程让出CPU的执行权。

nagios check_mk默认有对上下文的监控，其使用的方法是通过两/proc/stat文件里取到ctxt行，并取两个时间段之间的差值来确认。

1 2	[root@student02 ~]# cat /proc/stat\|grep ctxt ctxt 397480

7. 运行队列

每个CPU有两个运行队列，活动队列和过期队列，正在运行的任务被优先级高的抢占了，这个任务就到了过期队列，活动队列的任务运行完了，就又变成活动队列继续处理，活动队列和过期队列是一直相互交换的

nice的默认优先级区间为-20到19，数字越小，优先级越高
优先级有两种，静态优先级和动态优先级，nice是动态优先级，他的-20到19对应100-139，对应的静态优先级都是0

修改nice优先级

[root@student02 ~]# renice -20 1805
1805 (process ID) old priority 0, new priority -20
[root@student02 ~]# ps axo pid,comm,%cpu,nice |grep vsftpd
1805 vsftpd           0.0 -20

进程有三种优先级，f大于r大于o的优先级
sched_fifo先进先出，一个任务处理完了才轮到下个任务，一个优先级高的任务来了要先处理，处理完了继续我这个任务
sched_rr轮询，每个任务相同的时间，但是优先级高的拥有更多的时间片
sched_other其它优先级，nice，renice

给进程设置先进先出优先级，也可以跟命令

1	[root@student02 ~]# chrt -f -p 20 1805

给进程设置轮询优先级

1	[root@student02 ~]# chrt -r -p 20 1805

八、TCP/IP、Socket参数

所有的TCP/IP参数都位于/proc/sys/net目录下，对/proc/sys/net目录下内容的修改都是临时的，需要写到sysctl.conf文件中

参数（路径+文件）	描述	默认值
/proc/sys/net/core/rmem_default	默认的TCP数据接收窗口大小（字节）	229376
/proc/sys/net/core/rmem_max	最大的TCP数据接收窗口（字节）	131071
/proc/sys/net/core/wmem_default	默认的TCP数据发送窗口大小（字节）	229376
/proc/sys/net/core/wmem_max	最大的TCP数据发送窗口（字节）	131071
/proc/sys/net/core/netdev_max_backlog	在每个网络接口接收数据包的速率比内核处理这些包的速率快时，允许送到队列的数据包的最大数目	1000
/proc/sys/net/core/somaxconn	定义了系统中每一个端口最大的监听队列的长度，这是个全局的参数	128
/proc/sys/net/core/optmem_max	表示每个套接字所允许的最大缓冲区的大小	20480
/proc/sys/net/ipv4/tcp_mem	确定TCP栈应该如何反映内存使用，每个值的单位都是内存页（通常是4KB）。第一个值是内存使用的下限；第二个值是内存压力模式开始对缓冲区使用应用压力的上限；第三个值是内存使用的上限。在这个层次上可以将报文丢弃，从而减少对内存的使用。对于较大的BDP可以增大这些值（注意，其单位是内存页而不是字节）	94011 125351 188022
/proc/sys/net/ipv4/tcp_rmem	为自动调优定义socket使用的内存。第一个值是为socket接收缓冲区分配的最少字节数；第二个值是默认值（该值会被rmem_default覆盖），缓冲区在系统负载不重的情况下可以增长到这个值；第三个值是接收缓冲区空间的最大字节数（该值会被rmem_max覆盖）	4096 87380 4011232
/proc/sys/net/ipv4/tcp_wmem	为自动调优定义socket使用的内存。第一个值是为socket发送缓冲区分配的最少字节数；第二个值是默认值（该值会被wmem_default覆盖），缓冲区在系统负载不重的情况下可以增长到这个值；第三个值是发送缓冲区空间的最大字节数（该值会被wmem_max覆盖）	4096 16384 4011232
/proc/sys/net/ipv4/tcp_keepalive_time	TCP发送keepalive探测消息的间隔时间（秒），用于确认TCP连接是否有效	7200
/proc/sys/net/ipv4/tcp_keepalive_intvl	探测消息未获得响应时，重发该消息的间隔时间（秒）	75
/proc/sys/net/ipv4/tcp_keepalive_probes	在认定TCP连接失效之前，最多发送多少个keepalive探测消息	9
/proc/sys/net/ipv4/tcp_sack	启用有选择的应答（1表示启用），通过有选择地应答乱序接收到的报文来提高性能，让发送者只发送丢失的报文段，（对于广域网通信来说）这个选项应该启用，但是会增加对CPU的占用。	1
/proc/sys/net/ipv4/tcp_fack	启用转发应答，可以进行有选择应答（SACK）从而减少拥塞情况的发生，这个选项也应该启用。	1
proc/sys/net/ipv4/tcp_timestamps	TCP时间戳（会在TCP包头增加12个字节），以一种比重发超时更精确的方法（参考RFC 1323）来启用对RTT 的计算，为实现更好的性能应该启用这个选项。	1
/proc/sys/net/ipv4/tcp_window_scaling	启用RFC 1323定义的window scaling，要支持超过64KB的TCP窗口，必须启用该值（1表示启用），TCP窗口最大至1GB，TCP连接双方都启用时才生效。	1
/proc/sys/net/ipv4/tcp_syncookies	表示是否打开TCP同步标签（syncookie），内核必须打开了CONFIG_SYN_COOKIES项进行编译，同步标签可以防止一个套接字在有过多试图连接到达时引起过载。	1
/proc/sys/net/ipv4/tcp_tw_reuse	表示是否允许将处于TIME-WAIT状态的socket（TIME-WAIT的端口）用于新的TCP连接。	0
/proc/sys/net/ipv4/tcp_tw_recycle	能够更快地回收TIME-WAIT套接字。	0
/proc/sys/net/ipv4/tcp_fin_timeout	对于本端断开的socket连接，TCP保持在FIN-WAIT-2状态的时间（秒）。对方可能会断开连接或一直不结束连接或不可预料的进程死亡。	60
/proc/sys/net/ipv4/ip_local_port_range	表示TCP/UDP协议允许使用的本地端口号	32768 61000
/proc/sys/net/ipv4/tcp_max_syn_backlog	对于还未获得对方确认的连接请求，可保存在队列中的最大数目。如果服务器经常出现过载，可以尝试增加这个数字。	2048
/proc/sys/net/ipv4/tcp_low_latency	允许TCP/IP栈适应在高吞吐量情况下低延时的情况，这个选项应该禁用。	0
/proc/sys/net/ipv4/tcp_westwood	启用发送者端的拥塞控制算法，它可以维护对吞吐量的评估，并试图对带宽的整体利用情况进行优化，对于WAN 通信来说应该启用这个选项。	0
/proc/sys/net/ipv4/tcp_bic	为快速长距离网络启用Binary Increase Congestion，这样可以更好地利用以GB速度进行操作的链接，对于WAN通信应该启用这个选项。	1