hbase关于flush和compact参数

hbase 2.2.6

3 hmasters ，400+ regionservers，about 80w regions，

totoal hfiles compressed 20PB ，uncompressed 80P

(cores40 内存187.4G 磁盘150T)/nodes

主要写入操作bulkload与put，日写入150T

Regionserver 堆内存60G

1. hbase flush 和 compact对写的影响

简单说：

写数据时会先写到memstore

当memstore写满或达到其它flush触发条件，memstore刷写生成hfile

当flush时发现store下文件数超过该值hbase.hstore.blockingStoreFiles=16，会报too many store files阻塞flush一段时间<hbase.hstore.blockingWaitTime=90000=90s

同时flush阶段为memstore生成快照过程也会阻塞一段时间

当store中hfile数量大于hbase.hstore.compaction.min=3，或达到其它compact触发条件会触发compact

另外Bulkload 也会flush，导致storefiles数据增加变快，导致达到blockingStoreFiles数量后阻塞flush一段时间

关于flush触发条件，和compact触发条件见下面相关部分内容

1.1 附:写阻塞优化参数参考列表

参数仅供参考，实际还需要根据实际情况去分析，系统资源使用情况，日志呀，数据量呀各方面综合考虑

memstore flush

>hbase.hregion.compacting.memstore.type=BASIC
>hbase.hstore.flusher.count=5
>hbase.memstore.flush.size=268435456=256M
>hbase.regionserver.optionalcacheflushinterval=18000000=5hr
>hbase.hregion.memstore.block.multiplier=8
>hbase.regionserver.global.memstore.size.lower.limit=0.95f
>hbase.regionserver.global.memstore.upperLimit=0.4
>hbase.regionserver.global.memstore.size=0.4
>hbase.hstore.blockingWaitTime=-1
>hbase.hstore.blockingStoreFiles=200

compact

>hbase.regionserver.thread.compaction.small=10
>hbase.regionserver.thread.compaction.large=10
>hbase.hstore.compaction.kv.max=20
>hbase.hstore.compaction.min=6 //store下文件数达到6个开始compact
>hbase.hstore.compaction.max=10 //一次最多合并10个文件
>hbase.hstore.compaction.max.size=10737418240=10G //大于该值的minor compact不合并，可以调小一下
>hbase.hstore.compaction.min.size=134217728=128M 默认=hbase.memstore.flush.size 
>hbase.server.compactchecker.interval.multiplier=500 //iostat查看io利用率够的情况缩短compact周期10*1000*500=83min
>hbase.hstore.compaction.throughput.higher.bound=209715200 200M

hadoop

># datanodes 400+
>datanode heapsize=12.25g 
>dfs.datanode.max.transfer.threads=318524 
>dfs.datanode.handler.count=400
>dfs.namenode.blockreport.queue.size=40960
>dfs.blockreport.incremental.intervalMsec=1000
>dfs.datanode.socket.write.timeout=7200000
>ipc.maxinum.data.length=134217728
>fs.namenode.fs-limits.max-directory-items=6400000 
>dfs.namenode.replication.max-streams=20
>dfs.namenode.replication.max-streams-hard-limit=20
>dfs.namenode.handler.count=400
>dfs.datanode.du.reserved=16106127360
>fs.namenode.fs-limits.max-directory-items=6400000
>dfs.datanode.xceiver.stop.timeout.millis=181000
>dfs.client-write-packet-size=262144 
>dfs.client.hedged.read.threadpool.size=20
>dfs.client.hedged.read.threshold.millis=10
>dfs.client.read.striped.threadpool.size=30
>dfs.client.socket-timeout=180000
>dfs.datanode.du.reserved.pct=1 默认0
>dfs.datanode.xceiver.stop.timeout.millis=181000
>dfs.namenode.file.close.num-committed-allowed=1 默认0
>ipc.client.rpc-timeout.ms=180000 默认0

Others

>hbase.bucketcache.size=10240
>hbase.regionserver.handler.count=200
>hbase.ipc.server.read.threadpool.size=15
>hbase.mapreduce.bulkload.assign.sequenceNumbers=false //改源码后，disable bulkload flush
>hbase.hregion.max.filesize=10737418240，split已禁用
>hbase.regionserver.maxlogs=300
>hbase.rpc.timeout=120000  默认60000，生产S3目前都为90000
>hbase.ipc.server.listen.queue.size=10000 默认128

1.2 另外一个特别隐密的优化点，修改源码使bulkload不要flush

因为主要是通过bulkload，且量比较大，频繁flush会导致小文件增多，影响读写性能

// 在调用region.bulkLoadHFiles方法时，源码第二个参数assignSeqId写死为true了
// 将第二个参数通过参数获取配置hbase.mapreduce.bulkload.assign.sequenceNumbers=false，使不flush
org.apache.hadoop.hbase.regionserver.SecureBulkLoadManager#secureBulkLoadHFiles
// 优化前
            return region.bulkLoadHFiles(familyPaths, true,
                new SecureBulkLoadListener(fs, bulkToken, conf), request.getCopyFile(),
              clusterIds, request.getReplicate());
// 优化后
              return region.bulkLoadHFiles(familyPaths, request.getAssignSeqNum(),
                new SecureBulkLoadListener(fs, bulkToken, conf), request.getCopyFile(),
              clusterIds, request.getReplicate());

// 当assignSeqId=true时，会先flush
org.apache.hadoop.hbase.regionserver.HRegion#bulkLoadHFiles(java.util.Collection<org.apache.hadoop.hbase.util.Pair<byte[],java.lang.String>>, boolean, org.apache.hadoop.hbase.regionserver.HRegion.BulkLoadListener, boolean, java.util.List<java.lang.String>, boolean)
      // We need to assign a sequential ID that's in between two memstores in order to preserve
      // the guarantee that all the edits lower than the highest sequential ID from all the
      // HFiles are flushed on disk. See HBASE-10958.  The sequence id returned when we flush is
      // guaranteed to be one beyond the file made when we flushed (or if nothing to flush, it is
      // a sequence id that we can be sure is beyond the last hfile written).
      if (assignSeqId) {
        FlushResult fs = flushcache(true, false, FlushLifeCycleTracker.DUMMY);
        if (fs.isFlushSucceeded()) {
          seqId = ((FlushResultImpl)fs).flushSequenceId;
        } else if (fs.getResult() == FlushResult.Result.CANNOT_FLUSH_MEMSTORE_EMPTY) {
          seqId = ((FlushResultImpl)fs).flushSequenceId;
        } else if (fs.getResult() == FlushResult.Result.CANNOT_FLUSH) {
          // CANNOT_FLUSH may mean that a flush is already on-going
          // we need to wait for that flush to complete
          waitForFlushes();
        } else {
          throw new IOException("Could not bulk load with an assigned sequential ID because the "+
            "flush didn't run. Reason for not flushing: " + ((FlushResultImpl)fs).failureReason);
        }
      }

1.3 客户端优化

同时也需要客户端帮忙查看是否配置得当

比如put是否批量，batch大小，间隔等；bulkload

2. memstore flush

2.1 memstore flush参数说明

hbase.hstore.flusher.count，默认2

memstore flush写线程数，调大可加快flush写速度
不要大于disks数量的50%

hbase.memstore.flush.size，默认128M

当Region中任意一个MemStore的大小达到该值，触发MemStore刷新
注意，memstore大小可能会大于该值

如果一个region只有一个memstore，hbase.memstore.flush.size=246M，hbase.hregion.memstore.block.multiplier=4，那个这个memstore最达可达到2G

hbase.hregion.memstore.block.multiplier，默认4

region级别memstore flush因子，当Region中所有MemStore的大小总和达到了上限blockingMemStoreSize，会触发MemStore刷新
blockingMemStoreSize=hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size

hbase.regionserver.global.memstore.size.lower.limit，默认0.95f

当RegionServer中MemStore的大小总和超过低水位阈值，RegionServer开始强制执行flush，先flush MemStore最大的Region，再flush次大的，依次执行。
低水位阈值=hbase.regionserver.global.memstore.size.lower.limit * hbase.regionserver.global.memstore.size

hbase.regionserver.global.memstore.lowerLimit

新参数hbase.regionserver.global.memstore.size.lower.limit

hbase.regionserver.global.memstore.size，默认0.4f

占堆内存的百分比，总MemStore大小超过高水位阈值hbase.regionserver.global.memstore.size，RegionServer会阻塞更新并强制执行flush，直至总MemStore大小下降到低水位阈值

hbase.regionserver.global.memstore.upperLimit

新参数为hbase.regionserver.global.memstore.size

hbase.regionserver.optionalcacheflushinterval，默认3600000=1hr

hbase定期刷新hbase周期，确保MemStore不会长时间没有持久化。
为避免所有的MemStore在同一时间都进行f lush而导致的问题，定期的f lush操作有一定时间的随机延时。

hbase.hstore.blockingWaitTime，默认90000

当store下hfiles数量>hbase.hstore.blockingStoreFiles时，flush会阻塞一段时间hbase.hstore.blockingWaitTime，再继续flush
如果hbase.hstore.blockingWaitTime设置成<=0的值，则flush不阻塞，但是如果compact速度还是跟不上，store下数量肯定会继续增加

hbase.hstore.blockingStoreFiles，默认16

当store下hfiles数量>hbase.hstore.blockingStoreFiles时，flush会阻塞一段时间hbase.hstore.blockingWaitTime，再继续flush

hbase.regionserver.maxlogs，deprecated

当一个RegionServer中HLog数量达到该值，系统会选取最早的HLog对应的一个或多个Region进行f lush

2.2 memstore flush触发时机

MemStore级别限制
- 当Region中任意一个MemStore的大小达到了上限，会触发MemStore刷新
- 上限=hbase.hregion.memstore.flush.size，默认128MB
Region级别限制
- 当Region中所有MemStore的大小总和达到了上限，会触发MemStore刷新。
- 上限=hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size
RegionServer级别限制
- 当RegionServer中MemStore的大小总和超过低水位阈值，RegionServer开始强制执行flush，先flush MemStore最大的Region，再flush次大的，依次执行。
低水位阈值=hbase.regionserver.global.memstore.size.lower.limit * hbase.regionserver.global.memstore.size
- 如果此时写入吞吐量依然很高，导致总MemStore大小超过高水位阈值hbase.regionserver.global.memstore.size，RegionServer会阻塞更新并强制执行flush，直至总MemStore大小下降到低水位阈值。
当一个RegionServer中HLog数量达到上限（可通过参数hbase.regionserver.maxlogs配置）时，系统会选取最早的HLog对应的一个或多个Region进行f lush。
HBase定期刷新MemStore
- 默认周期为1小时（hbase.regionserver.optionalcacheflushinterval），确保MemStore不会长时间没有持久化。为避免所有的MemStore在同一时间都进行f lush而导致的问题，定期的f lush操作有一定时间的随机延时。
手动执行f lush

用户可以通过shell命令flush 'tablename'或者flush'regionname'分别对一个表或者一个Region进行f lush。
bulkload 前刷新

当使用bulkload导入hfile时，也会执行flush

2.3 memstore写执行流程

为了减少f lush过程对读写的影响，HBase采用了类似于两阶段提交的方式，将整个f lush过程分为三个阶段。

prepare阶段：遍历当前Region中的所有MemStore，将MemStore中当前数据集CellSkipListSet（内部实现采用ConcurrentSkipListMap）做一个快照snapshot，然后再新建一个CellSkipListSet接收新的数据写入。prepare阶段需要添加updateLock对写请求阻塞，结束之后会释放该锁。因为此阶段没有任何费时操作，因此持锁时间很短。
f lush阶段：遍历所有MemStore，将prepare阶段生成的snapshot持久化为临时文件，临时文件会统一放到目录.tmp下。这个过程因为涉及磁盘IO操作，因此相对比较耗时。
commit阶段：遍历所有的MemStore，将f lush阶段生成的临时文件移到指定的ColumnFamily目录下，针对HFile生成对应的storef ile和Reader，把storef ile添加到Store的storef iles列表中，最后再清空prepare阶段生成的snapshot。

3. compact流程

3.1 基本流程

HBase中Compaction只有在特定的触发条件才会执行，比如部分f lush操作完成之后、周期性的Compaction检查操作等。一旦触发，HBase会按照特定流程执行。

3.2 compaction相关参数

hbase.hstore.compaction.min，默认3
- 当store下storefiles数量达到该值，开始compact
- 最小为2，如果没有配置会检查并使用hbase.hstore.compactionThreshold，如果hbase.hstore.compactionThreshold也没配置使用默认值
- 适当增大该值，可以减少文件被重复执行compaction。但是如果过大，会导致Store中文件数过多而影响读取的性能
hbase.hstore.compactionThreshold

新参数hbase.hstore.compaction.min
hbase.hstore.compaction.max，默认10
- 一次compaction最多选取10个文件合并
- 与hbase.hstore.compaction.max.size的作用基本相同，主要是控制一次compaction操作的时间不要太长。
hbase.hstore.compaction.kv.max，默认10

flush或compact时读取或写的一批最大kv数
hbase.hstore.compaction.max.size，默认Long.MAX_VALUE
- 如果一个HFile文件的大小大于该值，那么在Minor Compaction操作中不会选择这个文件进行compaction操作，除非进行Major Compaction操作。
- 这个值可以防止较大的HFile参与compaction操作。在禁止Major Compaction后，一个Store中可能存在几个HFile，而不会合并成为一个HFile，这样不会对数据读取造成太大的性能影响。
hbase.hstore.compaction.min.size，默认=hbase.memstore.flush.size

表示文件大小小于该值的store file 一定会加入到minor compaction的store file中
hbase.regionserver.thread.compaction.small，默认1
- small compaction 线程池线程数
- 不要大于disks数量的50%
- small线程>=large线程数量
hbase.regionserver.thread.compaction.large，默认1
- large compaction 线程池线程数
- 不要大于disks数量的50%
hbase.server.thread.wakefrequency，默认10 * 1000=10s

可以作为service线程的sleep间隔，用默认值就挺好
hbase.server.compactchecker.interval.multiplier，默认1000
- 正常情况compaction会被flush等操作触发，防止长时间不合并，会周期性检查store是否需要执行Compaction
- 检查周期 =hbase.server.compactchecker.interval.multiplier*hbase.server.thread.wakefrequency，当io够用的情况，可以compact缩短周期
hbase.regionserver.compaction.check.period，默认hbase.server.thread.wakefrequency

compaction周期性检查
hbase.hregion.majorcompaction，默认1000 * 60 * 60 * 24 * 7=7天
hbase.hregion.majorcompaction.jitter，默认0.50F
hbase.regionserver.thread.compaction.throttle
- 默认=2*hbase.hstore.compaction.max*hbase.memstore.flush.size
- 一次合并文件总大小，大于该值交给large compaction pool合并，小于该值交给small compaction pool合并

3.3 Compaction触发时机

MemStore Flush
- 应该说Compaction操作的源头来自f lush操作，MemStore Flush会产生HFile文件，文件越来越多就需要compact执行合并。因此在每次执行完flush操作之后，都会对当前Store中的文件数进行判断，一旦Store中总文件数大于hbase.hstore.compactionThreshold（新参数hbase.hstore.compaction.min），就会触发Compaction。
- 需要说明的是，Compaction都是以Store为单位进行的，而在flush触发条件下，整个Region的所有Store都会执行compact检查，所以一个Region有可能会在短时间内执行多次Compaction。
后台线程周期性检查
- RegionServer会在后台启动一个线程CompactionChecker，定期触发检查对应Store是否需要执行Compaction，检查周期为hbase.server.thread. wakefrequency*hbase.server.compactchecker.interval.multiplier。和flush不同的是，该线程优先检查Store中总文件数是否大于阈值hbase.hstore.compactionThreshold，一旦大于就会触发Compaction；如果不满足，接着检查是否满足Major Compaction条件。简单来说，如果当前Store中HFile的最早更新时间早于某个值mcTime，就会触发MajorCompaction。mcTime是一个浮动值，浮动区间默认为[7-7×0.2，7+7×0.2]，其中7为hbase.hregion.majorcompaction，0.2为hbase.hregion.majorcompaction.jitter，可见默认在7天左右就会执行一次Major Compaction。用户如果想禁用Major Compaction，需要将参数hbase.hregion.majorcompaction设为0。
手动触发

一般来讲，手动触发Compaction大多是为了执行MajorCompaction。使用手动触发Major Compaction的原因通常有三个

其一，因为很多业务担心自动Major Compaction影响读写性能，因此会选择低峰期手动触发；

其二，用户在执行完alter操作之后希望立刻生效，手动触发MajorCompaction；

其三，HBase管理员发现硬盘容量不够时手动触发MajorCompaction，删除大量过期数据。

3.4 挑选合适的执行线程池

HBase实现中有一个专门的类CompactSplitThead负责接收Compaction请求和split请求，而且为了能够独立处理这些请求，这个类内部构造了多个线程池：largeCompactions、smallCompactions以及splits等。splits线程池负责处理所有的split请求，largeCompactions用来处理大Compaction，smallCompaction负责处理小Compaction。

这里需要明确三点：

上述设计目的是能够将请求独立处理，提高系统的处理性能。
大Compaction并不是Major Compaction，小Compaction也并不是MinorCompaction。HBase定义了一个阈值hbase.regionserver.thread.compaction.throttle，如果Compaction合并的总文件大小超过这个阈值就认为是大Compaction，否则认为是小Compaction。大Compaction会分配给largeCompactions线程池处理，小Compaction会分配给small Compactions线程池处理。
largeCompactions线程池和smallCompactions线程池默认都只有一个线程，用户可以通过参数hbase.regionserver.thread.compaction.large和hbase.regionserver.thread.compaction. small进行配置。

遇到的错误

RegionTooBusyException

hbase.memstore.flush.size=268435456=256M

hbase.hregion.memstore.block.multiplier=8

blockingMemStoreSize=hbase.memstore.flush.size*hbase.hregion.memstore.block.multiplier=2G

We throw RegionTooBusyException if above memstore limit, and expect client to retry using some kind of backoff

当memstore大小超过blockingMemStoreSize时就会报这个错误，也就是说memstore允许使用超过hbase.memstore.flush.size

使用过程发现，一个表只一个region，memstore大小居然达到1.8G还不flush，region级别flush卡住的情况，当然这是个异常情况，后来将这个rs重启，region assign到其它机器后正常flush了

最后把memstore调到512M就不报RegionTooBusy了

Too many open files

一般会改max open files，但改了后发现都不行，需要重启ambari agent

开启kerberos之后，服务会有两个进程，父进程是root，子进程是ocdp，而启动服务是通过ambari进行管理的，所以修改root的ulimit之后，需要先重启一下ambari agent，然后通过ambari agent 去启动服务就正常了。

https://community.cloudera.com/t5/Support-Questions/Too-many-open-files-in-region-server-logs/m-p/124185#M86929

参考地址

https://docs.cloudera.com/documentation/enterprise/latest/topics/cm_mc_hbase_service.html

https://utf7.github.io/docs/hbaseconasia2019track31430-190819225905.pdf

1. hbase flush 和 compact对写的影响

1.1 附:写阻塞优化参数参考列表

1.2 另外一个特别隐密的优化点，修改源码使bulkload不要flush

1.3 客户端优化

2. memstore flush

2.1 memstore flush参数说明

2.2 memstore flush触发时机

2.3 memstore写执行流程

3. compact流程

3.1 基本流程

3.2 compaction相关参数

3.3 Compaction触发时机

3.4 挑选合适的执行线程池

遇到的错误

RegionTooBusyException

Too many open files

参考地址

用户登录

今日阅读排行

一周阅读排行

1. hbase flush 和 compact对写的影响

1.1 附:写阻塞优化参数参考列表

1.2 另外一个特别隐密的优化点，修改源码使bulkload不要flush

1.3 客户端优化

2. memstore flush

2.1 memstore flush参数说明

2.2 memstore flush触发时机

2.3 memstore写执行流程

3. compact流程

3.1 基本流程

3.2 compaction相关参数

3.3 Compaction触发时机

3.4 挑选合适的执行线程池

遇到的错误

RegionTooBusyException

Too many open files

参考地址

hbase关于flush和compact参数

1. hbase flush 和 compact对写的影响

1.1 附:写阻塞优化参数参考列表

1.2 另外一个特别隐密的优化点，修改源码使bulkload不要flush

1.3 客户端优化

2. memstore flush

2.1 memstore flush参数说明

2.2 memstore flush触发时机

2.3 memstore写执行流程

3. compact流程

3.1 基本流程

3.2 compaction相关参数

3.3 Compaction触发时机

3.4 挑选合适的执行线程池

遇到的错误

RegionTooBusyException

Too many open files

参考地址

用户登录

今日阅读排行

一周阅读排行

给该专栏投稿 写篇新文章

收入到我管理的专栏 新建专栏

1. hbase flush 和 compact对写的影响

1.1 附:写阻塞优化参数参考列表

1.2 另外一个特别隐密的优化点，修改源码使bulkload不要flush

1.3 客户端优化

2. memstore flush

2.1 memstore flush参数说明

2.2 memstore flush触发时机

2.3 memstore写执行流程

3. compact流程

3.1 基本流程

3.2 compaction相关参数

3.3 Compaction触发时机

3.4 挑选合适的执行线程池

遇到的错误

RegionTooBusyException

Too many open files

参考地址

给该专栏投稿写篇新文章

收入到我管理的专栏新建专栏