Hbase Resources
Production hbase
Master status page
Note: You must be on vpn and you must update yours hosts file(see below) for these to work
To load each regionserver status page scroll down and click each link.
Production hbase hosts
You will need to insert these values into your /etc/hosts file
10.211.12.12 LKSBNMASTER02
10.211.12.22 LKSBNDATANODE02
10.211.12.23 LKSBNDATANODE03
10.211.12.24 LKSBNDATANODE04
10.211.12.25 LKSBNDATANODE05
10.211.12.26 LKSBNDATANODE06
10.211.12.27 LKSBNDATANODE07
10.211.12.28 LKSBNDATANODE08
Staging and UAT Hbase
Both staging and thunderdome are currently deployed as a stand alone hbase, which means that their regionserver and master server reside on the same node.
Staging Hbase - hbase0-lks.staging-2.banno-internal.com
Staging master status
Staging region server status:
UAT Hbase - http://hbase0-lks.uat-2.banno-internal.com
UAT master status
UAT region server status:
Restarting hbase
$ ssh hbase0-lks.staging-2.banno-internal.com
adam@staginghbase1:~$ sudo -s
[sudo] password for adam:
root@staginghbase1:/home/adam# /etc/init.d/hadoop-hbase-master stop
root@staginghbase1:/home/adam# /etc/init.d/hadoop-hbase-regionserver stop
root@staginghbase1:/home/adam# /etc/init.d/hadoop-hbase-master start
root@staginghbase1:/home/adam# /etc/init.d/hadoop-hbase-regionserver start
If hbase services do not stop normally above you can kill them by user
root@staginghbase1:/home/adam# sudo pkill -u hbase
Restarting an HBase Regionserver gracefully
On the node:
/usr/lib/hbase/bin/graceful_stop.sh `hostname`- if running into a region that gets stuck during it, manually assign in a
hbase shell(in a differnt tab, leaving the graceful_stop still running):assign 'banno_transaction| .....|01234558.0123abcdf8' - Ctrl-C when it tries to ssh after assigning regions and stopping.
- Start regionserver if necessary
General Hbase links
Troubleshooting Checklist
- Check the scala errors that we are seeing, hbase related errors are most prevalent in siphon and api but can occur in Che as well. They will usually show up as either
NotServingRegionExceptionsortimeoutswhen trying to makehbase calls. - Check the grafana Graphs to ensure that the hbase cluster is receiving requests and that the compaction times are ok, you can also see if the regions are in need of a balancing from these graphs. If you are looking at the data-services graphs Spikes in the
Account Context Timingmost often are a result of issues with hbase calls. You will also see a corresponding increase insiphonerrors. - Check the master Status page for dead regions and/or regions stuck in transition. Each will have their own section on the status page. Note: Regions in transition will on occasion have regions there and be working properly but they should not stay as a region in transition for long.
- Check Region Server Status pages to ensure that the region-servers are alive and do not have any issues
- Check for hbase inconsistencies on any node in the cluster with
hbase hbck
Common Issues
IN PRODUCTION ENVIRONMENT THESE FIXES MUST BE DONE BY THE OPS/Infra FIREFIGHTER
Regions Stuck in Transition
Occasionally a region will gets stuck when transitioning between regionservers. You can see this by looking at the master status page and see at the bottom that there are Regions in Transition that do not go away within a couple of minutes. You will also see an error like this in the logs for apps which use hbase (api, siphon, history):
org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: pending_transaction,c6c46790-7338-11e3-8ae1-005056a30036
|9223370645101975807|InstitutionPending |4f6293e0-900a-11e3-831b-005056a30032,1431623112470.ba71c4fb13dfbd5d687edf20cbf92431.
To fix stuck regions:
- SSH onto an hbase node (see above)
- Enter the shell with
hbase shell - Manually assign that region to a regionserver by running
assign '<full-region-name>'in the shell. An example of the full region name is shown in thelog lineabove, it is everything after theNotServingRegionException.- It should look something like:
id_to_hbasekey,080184b3afa71516c5e62d246cd23981e79bb6eb,1443190671719.a82dc39a1e28084653f4617e0c9c147c. - You can also get this from Right Click -> “View Source” on the hbase master status page
- It should look something like:
- Repeat the process till all the stuck regions have been assigned to a region.
Regionserver not responding to queries
On occasion a region server will enter a dead locked state where it appears to be alive to the master node but is not responding to any of the messages that are sent to it. In this case the regionserver is queing up all the db transactions that are sent to it in the write ahead log so that it does not lose anything but queries to look at data on these nodes do not return.
This problem will show itself by causing the region server status pages to not load properly.
To do any of the following commands on the HBase node, you need sudo. If you see something like the following error when using sudo <command>, first sudo, then run the command separately:
+======================================================================+
| Error: JAVA_HOME is not set and Java could not be found |
+----------------------------------------------------------------------+
| Please download the latest Sun JDK from the Sun Java web site |
| > http://java.sun.com/javase/downloads/ < |
| |
| HBase requires Java 1.6 or later. |
| NOTE: This script will find Sun Java whether you install using the |
| binary or the RPM based installer. |
+======================================================================+
To fix the dead locked regionserver
- Restart the regionserver process on the offending node:
/etc/init.d/hadoop-hbase-regionserver restartNote: you may need to force the process to stop and start it with/etc/init.d/hadoop-hbase-regionserver startif it does not respond to the restart command - Rebalance the cluster after the server is up and has completed its start up process.
Dead Region Server
You will be able to see that a region server is dead by checking in the status page of the master. Toward the bottom of the page there is a section labled Dead Region Servers, and that will display any regionservers that are dead.
This scenario is the most likely issue if you are looking into staging hbase issues.
To fix a dead the dead regionserver you will have to:
- Restart the process by running
/etc/init.d/hadoop-hbase-regionserver starton the server that has died. - Rebalance the cluster after the server is no longer in the dead servers list and has completed its start up process.
Rebalancing the cluster
A “cluster rebalance” is needed when the number of regions on each regionserver are not roughly even. A rebalence will usually have to be done after a service has been restarted in order to bring that node fully back into the cluster.
To Rebalance The Cluster:
- Enter the hbase shell on any server in the cluster with
hbase shell - Run
balancer. - If the regions do not start balancing, or the
balancercommand returns false, you will need to turn on balancing. To do this runbalance_switch truein the hbase shell, and then repeat thebalancercommand.
Truncating a hbase table
Often we truncate a few mobile data-services tables because they’re ephemerial data. To do this follow these steps:
On 10.211.12.12
$ sudo -i
$ su - hbase
$ hbase shell
> disable 'pending_id_to_hbasekey'
> drop 'pending_id_to_hbasekey'
> disable 'pending_transaction'
> drop 'pending_transaction'
Siphon will migrate/create the tables on boot, so on siphon0-lks.production-2.banno-internal.com:
$ sudo -i
$ sv restart banno-siphon-beta