Two problems with VMware?

Posted on August 14, 2009 by Tommy McGuire

In the last couple of weeks, we have had some perplexing VMware-related issues. The first I think we have a handle on; it appears to be a memory situation related to the balloon driver, but the second is a networking problem that has me stumped (and that's disheartening).

Can I not have a balloon?

Our production servers are currently VMware virtual machines running RHEL, configured with 8GB of memory and 2GB of swap space. On the vm, we are running an Apache httpd front-end and a SpringSource dm Server application server configured with 6GB max heap space. With the jvm's overhead, I figure that thing should use something around 7GB, giving another GB for httpd and assorted other processes. Normally, this configuration seems to be running very well. I have been monitoring the jvm's and have not seen any memory misbehavior; the heap and gc's look exactly like what they should. (Of course, the servers are fairly lightly loaded.)

The problem is that our SSdm's have been dying. Reviewing their log files shows nothing. The system logs, however, show that the oom-killer has been whacking our processes. After a certain amount of investigation, I think I am going to blame the VMware balloon driver and our cluster's memory overcommitment. Here's my theory: Our server is running along happily when the VMware host decides it needs to reclaim memory from our guest and inflates the balloon. That allocates memory from the client's guest OS, which seems to be tripping the out-of-memory condition, which ultimately saves the day by shooting our app server in the head.

Theoretically, that should not happen; the VMware balloon driver is a Linux kernel module, it should be capable of doing its job without going so far as to trigger a drastic out-of-memory situation. But I could be wrong; I have found several discussions on VMware's community forums with similar issues:

I'm not talking to you

The other issue has me more worried (and irritated). It is a little complex, so picture me waving my hands around because I am too lazy to draw a picture.

Suppose we have two VMware hosts, Host1 and Host2. Each of these hosts is running two client OS's, Server1 and Client1 on Host1 and Server2 and Client2 on Host2. Also, we have a load balancer outside the two hosts, configured to balance the load destined for "Server" between Server1 and Server2. (The two Clients are actually also servers, and load balanced, but they are acting as clients in this situation.)

Suppose further that Client1 attempts to make a request from Server. If the load balancer assignes the request to Server2, all is well. The request is handled, there are fluffy bunnies everywhere and rainbows and flowers. However, if the load balancer assigns the request to Server1, nothing happens. No handling, no bunnies, not even a dang daisy.

The problem, as it has been explained to me^[1] is that the connection from Client1 leaves Host1, visits the load balancer, is sent back to Host1, and is rejected because it is going back to the same host machine that it was sent from.

I have no real idea what is going on there, although my suspicion is that something is drastically misconfigured. Certainly, that cannot be happening.

When the request leaves Host1, it would have a source IP address from either Client1 or Host1 (depending on the VMware configuration), a source MAC address of either Client1 or Host1, a destination IP address of Server1, and a destination MAC address of either the load balancer or an intermediate router. After the load balancer makes its decision, the connection request would be transmitted with a source IP address of either Client1 or Host1, a source MAC address of the load balancer, a destination IP address of Server1, and a destination MAC address of Host1 or Server1. Any of those cases should be good.

Now, if I am wrong about that scenario and the post-load balancer packet has a source MAC address of Host1 and destination MAC address of Host1, then Host1 would be justified in discarding the packet. However, that means that the load balancer is acting as a switch (after doing IP-and-above based load balancing) and Host1 is configured as a router or gateway for Server1. I am not sure I believe either the former or the latter.

^[1] A significant chunk of the problem is that I have no access to anything. I have no visibility into VMware, the load balancer, the network, or anything other than the Servers. So I get to rely on second-hand or third-hand, probably inaccurate information.