Things to look out for when switching from Couchbase SDK 2 to 3
May 7, 2021Applying Other Skills to Code
July 2, 2021Statistical analysis is about deriving insight from information. Information can be plentiful without being insightful. Sometimes we use statistics for exploration, sometimes for argumentation. In the former case, we just want to understand and discover relationships, and we can take a neutral interest in what we might find. For the latter, we have already decided that we understand the mechanism beneath the data, and we are working to justify it for the benefit of others. In both cases, we want to get the data to tell its story.
In the mid 1800s, a young doctor named John Snow was confronted with the problem of Cholera outbreaks in London. The prevailing wisdom of the day blamed such sickness on miasma, or bad air. (For bonus points, identify a disease whose name literally means ‘bad air’.) John Snow did not accept the miasma explanation, but instead worked tirelessly to investigate and record every detail he could learn regarding every death he could trace in those London neighborhoods ravaged by cholera. Through his studies he became convinced that the disease was more likely to spread through contaminants in drinking water than through airborne means.
John Snow attempted to share his findings with the scientific authorities, and he was largely ignored. This reaction strikes us as tragic if we believe that John Snow was onto something, but it also illustrates a key dynamic that shapes our use of statistics: it is hard for people to change their minds. We should recognize, for example, that there was a cost associated with viewing cholera as a water issue. There was a cost associated with shutting down a pump in 1854 London. A change to a water supply would be a logistical complication that would potentially strain other resources, and the effect on individuals could range from inconvenience to financial hardship to serious health issues. Justification would be demanded of anyone who instituted such chaos.
Fortunately for the field of epidemiology, John Snow recognized that more compelling proof was required, and he diligently pursued the evidence and told the story. His analysis of an 1854 cholera outbreak led him to conclude that contaminated water from a particular pump on Broad Street was to blame. Next, he would need evidence. Snow reasoned that if the use of contaminated water (and not general proximity to bad air) was to blame, then there should be observably different outcomes among people who lived close to one another while relying on different water supplies, and this he was able to show.
His case for the Broad Street outbreak as a water issue included a carefully produced map with a Voronoi diagram. This showed that the physical shape of the outbreak was neatly defined by the area whose residents used the Broad Street pump. Probably we who have consumed so many maps, charts, and infographics cannot fully appreciate the novelty and genius of this map, which fetched and compiled death records from ledgers and lists and transformed them into a geographical image of the house by house impact of a public health crisis. This map told the story.
John Snow even went as far as examining some of the departures from his suggested association. Namely, a community within the Broad Street zone which was unaffected by cholera, and some deaths of people who lived outside of the zone and ought to have had access to a different pump. In both cases, he found the data continuing to tell its story, that cholera’s spread was less about the air and more about the bacteria in water. With his data mining and graphical analysis, Snow’s next trip to the local authorities was more successful, and the offending pump had its handle removed. Snow’s vindication was not complete or immediate (it is hard for people to change their minds), but it was a victory for medicine and statistical analysis.
Voronoi Diagrams
Included below are some examples of Voronoi diagrams. Their regions are defined by a basic construction in geometry, the perpendicular bisector. However, it is not simple or intuitive to visualize the regions from the starting points. The white points on these diagrams are sometimes called generators, and these correspond to the water sources in John Snow’s map. The buildings and roads in John Snow’s map instituted a non-euclidean geometry, meaning that the shortest distance between two locations was not along a straight line, but along roads and sidewalks. That affected the shape of his Voronoi regions.
As a programming exercise, try to make a Voronoi diagram! Instead of understanding the geometry and the location of points, we can use recursive programming to color the map. This works as follows:
Start with a square and some generators. Each generator belongs to a color. The nearest generator determines the color of a point. Here is the recursive rule: If the corners of a square are all the same color (they have the same nearest generator), then make the whole square that color. If they do not all have the same generator, we need to look more closely, so we divide the square into fourths and repeat the process with all four of the new (smaller) squares. Finally, we tell the program when to stop by saying something like, “if the square is smaller than one pixel, don’t do anything.”
Here are some intermediate snapshots of a construction with five generators. In these, I have included outlines to show the squares colored in each step.
Here are a few finished maps with seven random generators.
You can try to make such a picture yourself, or play around with the included code, which uses SVG (scalable vector graphics). For an extra challenge, make the generators draggable.
Further:
The clustering depicted in a Voronoi diagram involves the same geometry as the two-dimensional k-means clustering sometimes employed in unsupervised machine learning applications. https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
Florence Nightingale, a nurse in the Crimean war, was another pioneer of infographics. She used graphs and statistical analysis to advocate for improved hygiene in war hospitals in Constantinople (Not Istanbul).
Resources:
BBC historic figures: http://www.bbc.co.uk/history/historic_figures/snow_john.shtml
John Snow Archive and Research Companion: https://johnsnow.matrix.msu.edu/index.php
Wikipedia article on the Broad Street cholera outbreak (map source): https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak
Extra History video series: The Broad Street Pump
Part 1: https://www.youtube.com/watch?v=TLpzHHbFrHY
Part 2: https://www.youtube.com/watch?v=1jlsyucUwpo
Part 3: https://www.youtube.com/watch?v=9NVT6iZP2qg
Code:
(paste into a text editor and save as an html file)
<html>
<head>
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
<svg width="600px" style="border:1px solid black; background:#fff;" viewBox="0 0 100, 100" id="picture" overflow="scroll"></svg>
</body>
<script>
var NS = "http://www.w3.org/2000/svg";
delay = 500;
outlines = true;
precision = 0.2;
let pointCoords = getPointCoords(7);
let colors = ["#425af5", "#e6495e", "#fae01e", "#f58142", "#492669", "#26695d", "#2d3057"];
function getPointCoords(n){
let a = [];
for(let i = 0; i<n; i++){
a.push({"x":Math.random()*90+5,"y":Math.random()*90+5});
}
return a;
}
function colorVoronoiSquare(x, y, sideLength, pointCoords){
let aNearest = getClosestPointIndex(x, y, pointCoords);
let bNearest = getClosestPointIndex(x+sideLength, y, pointCoords);
let cNearest = getClosestPointIndex(x+sideLength, y+sideLength ,pointCoords);
let dNearest = getClosestPointIndex(x, y+sideLength, pointCoords);
if(aNearest === bNearest && bNearest === cNearest && cNearest === dNearest){
drawRect([[x,y],[x+sideLength,y],[x+sideLength,y+sideLength],[x,y+sideLength]], colors[aNearest]);
} else {
setTimeout(()=>{
let s = sideLength/2;
if(s > precision){
colorVoronoiSquare(x, y, s, pointCoords);
colorVoronoiSquare(x, y+s, s, pointCoords);
colorVoronoiSquare(x+s, y+s, s, pointCoords);
colorVoronoiSquare(x+s, y, s, pointCoords);
}
}, delay);
}
}
colorVoronoiSquare(0, 0, 100, pointCoords);
function getClosestPointIndex(x, y, pointCoords){
let closestIndex = 0;
let minD = Math.pow((x-pointCoords[0].x), 2) + Math.pow((y-pointCoords[0].y), 2);
for(p in pointCoords){
let d = Math.pow((x-pointCoords[p].x), 2) + Math.pow((y-pointCoords[p].y), 2);
if(d < minD){
closestIndex = p;
minD = d;
}
}
return closestIndex;
}
function drawPoints(pointCoords){
let svg = document.getElementById("picture");
for(p of pointCoords){
let c = document.createElementNS(NS, "circle");
c.setAttribute("cx", p.x);
c.setAttribute("cy", p.y);
c.setAttribute("r", 1.2);
c.setAttribute("stroke-width",.5);
c.setAttribute("fill","#fff");
c.setAttribute("stroke","#000");
svg.appendChild(c);
}
}
function drawRect(coords, color){
let svg = document.getElementById("picture");
let rect = document.createElementNS(NS, "polygon");
for(p of coords){
var point = svg.createSVGPoint();
point.x = p[0];
point.y = p[1];
rect.points.appendItem(point);
}
rect.setAttribute("stroke-width", outlines ? .1 : 0);
rect.setAttribute("fill",color);
rect.setAttribute("stroke","#000");
svg.appendChild(rect);
}
setTimeout(()=>{
drawPoints(pointCoords);
},5000);
</script>
</html>