Employing an optimal traffic light control policy has the potential of having a positive impact, both economic and environmental, on urban mobility. Reinforcement learning techniques have shown promising results in optimizing control policies for basic intersections and low volume traffic. This paper addresses the traffic light control problem in a complex scenario, such as a signalized roundabout with heavy traffic volumes, with the aim of maximizing throughput and avoiding traffic jams. We formulate the environment with a realistic representation of states and actions and a capacity-based reward. We enforce episode terminal conditions to avoid unwanted states, such as long queues interfering with other junctions in the vehicular network. A time-dependent baseline is proposed to reduce the variance of Policy Gradient updates in the setting of episodic conditions, thus improving the algorithm convergence to an optimal solution. We evaluate the method on real data and highly congested traffic, implementing a signalized simulated roundabout with 11 phases. The proposed method is able to avoid traffic jams and achieves higher performance than traditional time-splitting policies and standard Policy Gradient on average delay and effective capacity, while drastically decreasing the emissions.