Support

Expand all | Collapse all

Jama Administration: server logs

  • 1.  Jama Administration: server logs

    Posted 04-05-2017 07:45
    Hello,

    Yesterday, our production Jama server suddenly and unexpectedly became unavailable.  I logged into the Jama Admin Console to discover that it was in a stopped state.

    I looked at all the logs that were timestamped yesterday under the various log locations here:  Jama User Guide  but I didn't see anything that would give any clues as to what could have potentially caused the server to 'fail/stop'.  These are the log files I looked at:
    • catalina.2017-04-04.log
    • contour-access.log
    • contour-api.log
    • contour-cache.log
    • contour-dwr.log
    • contour.log
    • error.log
    • gc.log.0.current
    • jama-job.log
    • JamaElasticSearch.log
    • jamasearch-2017-04-04.0.log
    • localhost.2017-04-04.log
    • localhost_access_log.2017-04-04.txt
    • sentry.log
    Are there other logs that I missed that could contain relevant forensic information that one should look at in such circumstances?

    On a similar but unrelated note, I read that the dbo.evententry contains logs of many different types of user actions.  Is there a similar table that captures the various configuration actions of users with the ability to manage/admin Jama projects as well as activities done by Jama admins and even the root user?  (For example, logging of user group membership changes, item type modifications, project relationship configurations, .etc)

    Thank you.

    ------------------------------
    victor
    ------------------------------


  • 2.  RE: Jama Administration: server logs

    Posted 04-05-2017 08:46
    Hi Victor,

    These logs are all a great place to start. However, with Jama 8, it’s not enough—you also need to look into Replicated logs and OS logs.
    You can find Replicated log details here: https://www.replicated.com/docs/packaging-an-application/support-bundle/
    From these logs, you should check:
    daemon/commands/dmesg for possible OS issues like Out Of Memory

    /daemon/docker/docker_info.json to see if you can tell what container died first

     I checked with Shabnam (who knows way more about Replicated!) and she suggests checking that Replicated doc out and being creative—look at related logs to see if there is some sort of network issue or something wrong with any Replicated service.

    ------------------------------
    Kristina King
    Jama Software
    ------------------------------



  • 3.  RE: Jama Administration: server logs

    Posted 04-06-2017 05:20
    Thanks so much, Kristina.

    I will let you know.

    Thanks!

    ------------------------------
    victor
    ------------------------------



  • 4.  RE: Jama Administration: server logs

    Posted 04-06-2017 08:23
      |   view attached

    Hi Kristina,

    I looked at every single log filecontained inthe support bundle (which I just generated from our Production Jama Admin console..which is2 days since the incident).

    I saw the dmesg file which appeared more than once in the support bundle, but frankly, this file is light years beyond my understanding, lol.  It appears to give a current 'snapshot' of the state of the device drivers of the server.  Nothing jumped out at me in this file and I don't know how to possibly approach correlating it with the incident.  It seemed to be an informational file rather than a "if-there-was-something-wrong-it-would-be-caught-here" type of file to me.  However, again, I do not possess the knowledge to make use of this file.

    I did find the docker_info.json file, but not in the path you mentioned, and actually, my support bundle didn't have the /daemon/docker folder at all (we're running Jama 8.3 with replicated 1.x so that's probably why).  

    Mine was found in hosts\<long unique number>\docker\docker_info.json:

    [
    "Debug=false",
    "OomKillDisable=true",
    "KernelVersion=3.10.0-229.el7.x86_64",
    "Name=atljamapp02",
    "Labels=null",
    "ContainersRunning=4",
    "DriverStatus=[[\"Pool Name\",\"docker-253:3-50104-pool\"],[\"Pool Blocksize\",\"65.54 kB\"],[\"Base Device Size\",\"10.74 GB\"],[\"Backing Filesystem\",\"xfs\"],[\"Data file\",\"/dev/loop0\"],[\"Metadata file\",\"/dev/loop1\"],[\"Data Space Used\",\"1.719 GB\"],[\"Data Space Total\",\"107.4 GB\"],[\"Data Space Available\",\"18.17 GB\"],[\"Metadata Space Used\",\"4.178 MB\"],[\"Metadata Space Total\",\"2.147 GB\"],[\"Metadata Space Available\",\"2.143 GB\"],[\"Udev Sync Supported\",\"true\"],[\"Deferred Removal Enabled\",\"false\"],[\"Deferred Deletion Enabled\",\"false\"],[\"Deferred Deleted Device Count\",\"0\"],[\"Data loop file\",\"/var/lib/docker/devicemapper/devicemapper/data\"],[\"Metadata loop file\",\"/var/lib/docker/devicemapper/devicemapper/metadata\"],[\"Library Version\",\"1.02.107-RHEL7 (2015-12-01)\"]]",
    "Plugins={\"Authorization\":null,\"Network\":[\"bridge\",\"null\",\"host\"],\"Volume\":[\"local\"]}",
    "NCPU=4",
    "Containers=5",
    "ContainersStopped=1",
    "CpuCfsPeriod=true",
    "BridgeNfIptables=false",
    "NGoroutines=76",
    "IndexServerAddress=https://index.docker.io/v1/",
    "MemTotal=16659513344",
    "ID=Y2IA:GIZ6:XIAT:SFXY:LPCP:6NTW:NJ7X:OXJR:47CP:XB2R:H47D:NE5O",
    "Images=6",
    "Driver=devicemapper",
    "KernelMemory=true",
    "CPUSet=true",
    "CgroupDriver=cgroupfs",
    "OSType=linux",
    "ExperimentalBuild=false",
    "ContainersPaused=0",
    "BridgeNfIp6tables=false",
    "LoggingDriver=json-file",
    "NEventsListener=1",
    "Architecture=x86_64",
    "DockerRootDir=/var/lib/docker",
    "HttpsProxy=",
    "ClusterStore=",
    "SystemStatus=null",
    "ClusterAdvertise=",
    "IPv4Forwarding=true",
    "NFd=53",
    "SystemTime=2017-04-06T08:44:45.583498373-04:00",
    "ExecutionDriver=",
    "HttpProxy=",
    "CpuCfsQuota=true",
    "NoProxy=",
    "ServerVersion=1.11.1",
    "RegistryConfig={\"IndexConfigs\":{\"docker.io\":{\"Mirrors\":null,\"Name\":\"docker.io\",\"Official\":true,\"Secure\":true}},\"InsecureRegistryCIDRs\":[\"127.0.0.0/8\"],\"Mirrors\":null}",
    "SwapLimit=true",
    "CPUShares=true",
    "OperatingSystem=Red Hat Enterprise Linux",
    "MemoryLimit=true"
    ]

    The rest of the log files were either obviously not relevant (eg. access logs), or unreadable (binary?), or appeared to contain no helpful incident cause related information.

    For example, the various Java vm gc logs.  While they did contain errors, the errors occurred consistently prior to the incident as well.

    And of the handful of logs that appeared to capture events, they didn't seem to capture what stopped the server.  They captured when I restarted the Jama server, but prior to that the logging seemed to be normal.

    For example here is an excerpt from jamacore\files\var\log\tomcat\contour\contour-cache.log:

    2016-12-01 00:11:11,683 WARN localhost-startStop-1 jamadb [net.sf.ehcache.config.CacheConfiguration] - Cache: integrationConnections has a maxElementsInMemory of 0. This might lead to performance degradation or OutOfMemoryError at Terracotta client.From Ehcache 2.0 onwards this has been changed to mean a store with no capacity limit. Set it to 1 if you want no elements cached in memory
    2016-12-01 00:11:11,683 WARN localhost-startStop-1 jamadb [net.sf.ehcache.config.CacheConfiguration] - Cache: backingProjectLocksCache has a maxElementsInMemory of 0. This might lead to performance degradation or OutOfMemoryError at Terracotta client.From Ehcache 2.0 onwards this has been changed to mean a store with no capacity limit. Set it to 1 if you want no elements cached in memory
    2016-12-01 00:11:11,683 WARN localhost-startStop-1 jamadb [net.sf.ehcache.config.CacheConfiguration] - Cache: backingBaselineScopeCache has a maxElementsInMemory of 0. This might lead to performance degradation or OutOfMemoryError at Terracotta client.From Ehcache 2.0 onwards this has been changed to mean a store with no capacity limit. Set it to 1 if you want no elements cached in memory
    2017-04-04 14:51:56,584 WARN localhost-startStop-1 [net.sf.ehcache.config.CacheConfiguration] - Cache: backingDomainReferenceSetCache has a maxElementsInMemory of 0. This might lead to performance degradation or OutOfMemoryError at Terracotta client.From Ehcache 2.0 onwards this has been changed to mean a store with no capacity limit. Set it to 1 if you want no elements cached in memory
    2017-04-04 14:51:56,709 WARN localhost-startStop-1 [net.sf.ehcache.config.CacheConfiguration] - Cache: backingNodesByIdCache has a maxElementsInMemory of 0. This might lead to performance degradation or OutOfMemoryError at Terracotta client.From Ehcache 2.0 onwards this has been changed to mean a store with no capacity limit. Set it to 1 if you want no elements cached in memory
    2017-04-04 14:51:56,710 WARN localhost-startStop-1 [net.sf.ehcache.config.CacheConfiguration] - Cache: backingProjectLocksCache has a maxElementsInMemory of 0. This might lead to performance degradation or OutOfMemoryError at Terracotta client.From Ehcache 2.0 onwards this has been changed to mean a store with no capacity limit. Set it to 1 if you want no elements cached in memory

    The above indicates the last time the cache was "start/Stop" was Dec 1, 2016, and the next entry was on Apr 4, 2017 at 14:51 UTC which was the time that I restarted Jama.

    The most interesting logs were the contour logs (contour.log, contour.log.1, contour.log2).  All of these logs represented the day of the incident and the day before:
    • contour.log - April 4
    • contour.log1 - April 3
    • contour.log2 - April 3

    Each of the logs contained many LDAP related exceptions, due to our LDAP server/provider having sporadic issues.  But I noticed that only on April 4th there were "system clock" INFO logs (entries gathered here):

    2017-04-04 13:50:51,583 INFO cached1 [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] System clock apparently jumped from 2017-04-04T13:50:00.054 to 2017-04-04T13:50:49.940 since last heartbeat (+44886 ms)
    2017-04-04 13:51:27,271 INFO cached4 [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] System clock apparently jumped from 2017-04-04T13:50:49.940 to 2017-04-04T13:51:27.210 since last heartbeat (+32270 ms)
    2017-04-04 13:52:06,337 INFO cached1 [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] System clock apparently jumped from 2017-04-04T13:51:27.210 to 2017-04-04T13:52:06.334 since last heartbeat (+34124 ms)
    2017-04-04 13:52:43,116 INFO cached6 [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] System clock apparently jumped from 2017-04-04T13:52:06.334 to 2017-04-04T13:52:40.410 since last heartbeat (+29076 ms)
    2017-04-04 13:53:19,373 INFO cached14 [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] System clock apparently jumped from 2017-04-04T13:52:40.410 to 2017-04-04T13:53:18.733 since last heartbeat (+33323 ms)
    2017-04-04 13:53:44,520 INFO cached14 [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] System clock apparently jumped from 2017-04-04T13:53:18.733 to 2017-04-04T13:53:44.519 since last heartbeat (+20786 ms)
    2017-04-04 13:57:27,894 INFO cached8 [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] System clock apparently jumped from 2017-04-04T13:53:44.519 to 2017-04-04T13:57:27.728 since last heartbeat (+218209 ms)
    2017-04-04 13:58:47,012 INFO cached25 [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] System clock apparently jumped from 2017-04-04T13:57:27.728 to 2017-04-04T13:58:46.926 since last heartbeat (+74198 ms)
    2017-04-04 14:01:12,106 INFO cached16 [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] System clock apparently jumped from 2017-04-04T14:00:06.839 to 2017-04-04T14:01:10.868 since last heartbeat (+59029 ms)
    2017-04-04 14:04:46,858 INFO cached25 [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] System clock apparently jumped from 2017-04-04T14:03:35.884 to 2017-04-04T14:04:46.853 since last heartbeat (+65969 ms)
    2017-04-04 14:07:08,736 INFO cached10 [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] System clock apparently jumped from 2017-04-04T14:06:45.576 to 2017-04-04T14:07:08.725 since last heartbeat (+18149 ms)

    Our Jama outage was first reported at the time of 14:39, which seems to corroborate with the timestamps at which these "System clock" logs started to appear with the first one happening at 13:50 and continuing to the last entry 14:07.

    In between these "System clock" entries I picked out the following which were all WARN entries:

    2017-04-04 14:01:12,117 WARN C3P0PooledConnectionPoolManager[identityToken->1bqq1hg9k1ez37v2199gu43|3c1acf0c]-HelperThread-#0 temp_tenant_to_verify_session [com.mchange.v1.db.sql.StatementUtils] - Statement close FAILED.
    java.sql.SQLException: Invalid state, the Connection object is closed.
    at net.sourceforge.jtds.jdbc.TdsCore.checkOpen(TdsCore.java:482)
    at net.sourceforge.jtds.jdbc.TdsCore.clearResponseQueue(TdsCore.java:767)
    at net.sourceforge.jtds.jdbc.JtdsStatement.reset(JtdsStatement.java:755)
    at net.sourceforge.jtds.jdbc.JtdsStatement.close(JtdsStatement.java:999)
    at com.mchange.v1.db.sql.StatementUtils.attemptClose(StatementUtils.java:53)
    at com.mchange.v2.c3p0.impl.DefaultConnectionTester.activeCheckConnection(DefaultConnectionTester.java:124)
    at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$1PooledConnectionResourcePoolManager.testPooledConnection(C3P0PooledConnectionPool.java:509)
    at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$1PooledConnectionResourcePoolManager.testPooledConnection(C3P0PooledConnectionPool.java:463)
    at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool$1PooledConnectionResourcePoolManager.refurbishIdleResource(C3P0PooledConnectionPool.java:435)
    at com.mchange.v2.resourcepool.BasicResourcePool$AsyncTestIdleResourceTask.run(BasicResourcePool.java:2151)
    at com.mchange.v2.async.ThreadPoolAsynchronousRunner$PoolThread.run(ThreadPoolAsynchronousRunner.java:648)

    2017-04-04 14:51:29,885 WARN  localhost-startStop-1   [com.getsentry.raven.dsn.Dsn] - Couldn't find a suitable DSN, defaulting to a Noop one.

    2017-04-04 13:57:27,912 WARN  cached8   [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [172.17.0.4]:5701 [dev] [3.6] Resetting master confirmation timestamps because of huge system clock jump! Clock-Jump: 218209 ms, Master-Confirmation-Timeout: 350000 ms

    That's it in terms of what I could find among the support bundle logs.

    Could the "System clock" jumping cause Jama to stop?  I have attached the entire support bundle.

    Thanks,
    victor



    ------------------------------
    victor
    ------------------------------

    Attachment(s)



  • 5.  RE: Jama Administration: server logs

    Posted 04-06-2017 15:33
    Victor, this is way over my head, so I'm opening a support ticket so a colleague can check it out.

    ------------------------------
    Kristina King
    Jama Software
    ------------------------------



  • 6.  RE: Jama Administration: server logs

    Posted 04-07-2017 05:11
    Thank you, Kristina!

    ------------------------------
    victor
    ------------------------------