Our new Satan: artificial idiocy and big data mining

This is what we are increasingly trusting our future to:


Notice the parenthetical. It would seem that Tumblr’s artificial intelligence idiocy porn filter hasn’t improved much since December 2018, when Louise Matsakis explained that

computers . . . detect whether groups of pixels appear similar to things they’ve seen in the past. Tumblr’s automated content moderation system might be detecting patterns the company isn’t aware of or doesn’t understand. “Machine learning excels at identifying patterns in raw data, but a common failure is that the algorithms pick up accidental biases, which can result in fragile predictions,” says Carl Vondrick, a computer vision and machine learning professor at Columbia Engineering. For example, a poorly trained AI for detecting pictures of food might erroneously rely on whether a plate is present rather than the food itself.[1]

A word you’ll see often with artificial idiocy is “probabilistic.” This means that the systems are relying on statistical methods—correlation—to reach their conclusions. Matsakis again:

WIRED tried running several Tumblr posts that were reportedly flagged as adult content through Matroid’s NSFW natural imagery classifier, including a picture of chocolate ghosts, a photo of Joe Biden, and one of Burstein’s patents, this time for LED light-up jeans. The classifier correctly identified each one as SFW, though it thought there was a 21 percent chance the chocolate ghosts might be NSFW. The test demonstrates there’s nothing inherently adult about these images—what matters is how different classifiers look at them.[2]

The different results likely result from different data sets used to “train” the porn filters,[3] which is to say we’re right back to what we said when I was a computer programmer in the late 1970s and early 1980s: “Garbage In, Garbage Out.”

But the first thing they teach you in any statistics class is “correlation does not prove causation.” One of my statistics texts cites an example where “weekly flu medication sales and weekly sweater sales for an area with extreme seasons would exhibit a positive association because both tend to go up in the winter and down in the summer.”[4] The text continues:

The problem is that the explanation for an observed relationship usually isn’t so obvious as it is in the medication and sweater sales example. Suppose the finding in an observational study is that people who use vitamin supplements get fewer colds than people who don’t use vitamin supplements. It may be that the use of vitamin supplements causes a reduced risk of a cold, but it is also easy to think of other explanations for the observed association. Perhaps those who use supplements also sleep more and it’s actually the sleep difference that causes the difference in the frequency of colds. Or, the users of vitamin supplements may be worried about good health and will take several different actions designed to reduce the risk of getting a cold.[5]

The problem here appears in both examples. It isn’t that sweater sales cause people to buy flu medicine or vice versa. It’s that people want to stay warm in the winter, in part to reduce the risk of falling ill, and that they buy flu medication once they do anyway. The risk and the response both occur at more or less the same time, so a correlation appears due to a third factor, cold weather, which stimulates actions intended to stay warm and is also presumed to increase susceptibility to the flu.

Similarly, vitamins only might reduce the risk of catching cold. Other healthy habits, getting enough sleep, for example, might also help.

In neither case has a causal relationship been shown between correlated variables. In one case, the causal variable is (presumably) winter. In another, it might be extra sleep or other preventative measures. But artificial idiocy relies on correlation anyway, even as statisticians jump up and down and scream, don’t do that!

The computer scientists’ response is to rely on “big data.” If you have a large sample size, this improves the confidence interval, allegedly reducing the probability of error. But the problem lies in the method itself, rather than in the confidence interval, as we see with Tumblr’s flagging of the image in Cory Doctorow’s tweet quoted above.

Long before artificial idiocy, the professor in my first methods class, Valerie Sue, warned against “data mining,” which she explained as searching large datasets for correlations. These correlations might be entirely spurious, she explained, because they may fail to correctly associate cause with actual causal variables. “Big data” is simply “big data mining,” the very thing she warned against, just on a massive scale.

Tumblr seems to have chosen its “big data” poorly for identifying porn and seems to have failed to fix this problem that is now over a year old. But the fallacy fundamentally remains the same: In fact, correlation proves absolutely nothing.

One of the reasons I had to leave the San Francisco Bay Area was that, nonetheless, it was apparent from billboards to be seen nearly everywhere I drove, that artificial idiocy is the new god. It is worshipped. It is trusted. It is taken as infallible.

From a human science perspective, this idolatry is appalling. Statistical data are quantitative and therefore superficial. I can’t emphasize this strongly enough: These correlations are thus between superficial variables. Our new priesthood consists of those who “operationalize” rich reality, reducing it to mere quantities. Our new prophets are those who report the results.

This is even more appalling from a systems theory perspective, in which linear causation is the exception rather than the rule: A rarely causes B; rather, A and B arise jointly, influencing each other, constraining each other in some ways, enhancing each other in other ways. Except it isn’t just A and B. It’s a multitude of factors in a relationship of mutual causality, creating an ecosystem, producing emergent properties, results that could not be forecast and cannot be reduced to the components of the system.[6]

Artificial idiocy is not merely a false god. It is a modern Satan, tempting us with correlations, leading us astray with superficial data, and taken as authoritative simply because it processes data beyond the capacity of mere humans.

  1. [1]Louise Matsakis, “Tumblr’s Porn-Detecting AI Has One Job—and It’s Bad at It,” Wired, December 5, 2018, https://www.wired.com/story/tumblr-porn-ai-adult-content/
  2. [2]Louise Matsakis, “Tumblr’s Porn-Detecting AI Has One Job—and It’s Bad at It,” Wired, December 5, 2018, https://www.wired.com/story/tumblr-porn-ai-adult-content/
  3. [3]Louise Matsakis, “Tumblr’s Porn-Detecting AI Has One Job—and It’s Bad at It,” Wired, December 5, 2018, https://www.wired.com/story/tumblr-porn-ai-adult-content/
  4. [4]Jessica M. Utts and Robert F. Heckard, Mind on Statistics, 2nd ed. (Belmont, CA: Brooks/Cole, 2004), 157.
  5. [5]Jessica M. Utts and Robert F. Heckard, Mind on Statistics, 2nd ed. (Belmont, CA: Brooks/Cole, 2004), 157.
  6. [6]Fritjof Capra, The Web of Life: A New Scientific Understanding of Living Systems (New York: Anchor, 1996); Joanna Macy, Mutual Causality in Buddhism and General Systems Theory (Delhi, India: Sri Satguru, 1995).

3 thoughts on “Our new Satan: artificial idiocy and big data mining

  • Most jobs on LinkedIn require data mining skills. I have come up with 30 product enhancement ideas for existing products. I have offered to GIVE THE IDEA UP FRONT in exchange for a deal memo that the company would construct. Once that folly played out and I got no bites, I tried to GIVE AWAY product enhancement ideas to build up my testimonials. My bite rate was probably 2% for a FREE IDEA, with no strings attached if the idea was not liked. If the idea was liked, then all that would be required would be a LinkedIn testimonial.

    It’s pratically impossible to share nowadays because everyone is too busy data mining.

  • Now this is embarrassing. My first comment disappeared after I posted it, so I left a follow up comment to explain how my experience was. I was about to click off when suddenly, my first comment became visible. So now I have left a third comment because I didn’t want to exit with an incorrect observation.
    Our analytics and data mining world can’t even get the basics right. Try this experiment, post a link on facebook, let the corresponding image appear, then change the link. The image will not change. The image will remain with the first link even though the first link was replaced before posting. This Facebook defect has been around probably since day one. It recently cost me a LOT of hits for a youtube video I made. I chose to change the thumbnail after I had posted the video on youtube. Within a couple of days, the corresponding thumbnail on facebook went blank. Facebook had no way to recognize the new thumbnail. I ended up having to repost the topic with a picture insert and text link to match the new thumbnail. However, since most people click on the image, the image took them nowhere. I had posted a visual link to the new video but could not burn it into the image so I basically watched the view count drop by a remarkable amount even as what I had posted about made the news. If our data mining whizzes can’t even discover the SIMPLEST of errors regarding how their own pages work, what hope is there for data mining to get it right in other areas?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.