I recently read Russell’s book Human Compatible , which proposes as a solution-in-principle to the AI alignment problem the following three laws: The sole objective of the AI is to maximize human preferences. The AI is initially uncertain about what those preferences are. Human behavior is the primary source of information about human preferences. Russell then spends a considerable portion of the book discussing what this would look like in practice, and how such an AI would deal with various types of human failures to conform to the mathematical ideal of rationality, and a consequentialist approach to ethics as applied to these AI. While he provides (or at least gestures towards) technical solutions to many of the problems he raises, it’s clear the approach as a whole is still aspirational; this is not (yet) a cookbook, though Russell is hopeful applicable recipes can be invented and mathematical proofs of guaranteed benefit can be composed. After some consideration, there are two problems that stick in my mind. I would greatly appreciate any discussion of these two problems, but especially discussion that proposes plausible solutions. 1: AI must be made good before it is safe to make it smart, but it must be smart to be good. Russell describes in one example an official, Harriet the human, who takes bribes to fund her children’s education, as she cannot afford college on her meager salary as a public servant. Her provably beneficial robot Robbie, Russell claims, will not take up the task of helping her extract bribes more effectively, but instead find other ways to assist with getting the kids to college. Russell doesn’t provide details, but one might imagine Robbie tutoring the kids to boost their academics, identifying relevant scholarships and helping them apply, or finding Harriet a higher-paying job. My problem here is that Robbie may need better-than-human-average theory of mind and general intelligence to frame the problem in such a manner and find an even halfway effective solution, on top of decent “morality”. Robbie must see past Harriet’s instrumental goals (bribetaking, making money) to her terminal goals (get kids to college, give them better future prospects), possibly without Harriet ever explicitly admitting her goals or methods. He must decide that the terminal goals are the important ones, and invent ways to satisfy them without harming other humans. If he tutors the kids, he needs to understand all their schoolwork (which most parents struggle with) and be able to explain it well (which many teachers struggle with). To get scholarships or a job, he needs to be able to navigate lots of complex human structures and processes to identify good opportunities, then needs to step back and coach them through gaining the opportunity themselves, rather than applying on their behalf. In short, to come up with this ‘good’ (‘provably beneficial’) solution, Robbie needs to be smart. But anyone familiar with the alignment problem knows it is not safe to make superintelligent AI (which I will loosely define as ‘AI smarter than its user’) until the alignment problem is thoroughly solved; in other words, it has to be ‘good’ before we can allow it to be smart. That’s a circular problem; we can’t have either before we have the other, and vice versa. 2: A clearly identified type of ‘irrationality’ can be worked around, but how do we tell them apart? Suppose Robbie has worked for Harriet for a while, and has drawn conclusions about her dietary preferences. Then, one day, she refuses food he thinks she would like. How does Robbie handle it? The unacceptably glib answer is “Robbie updates his model of Harriet’s preferences.” In actual practice, a severe preference model/behavior mismatch can happen for a variety of reasons, which should be handled with different (sometimes opposing) strategies. Here are several real-world examples of how a mismatch might happen: Harriet’s preferences are more complex than Robbie’s model can describe. (E.g., she prefers one meal on workdays and one meal when not working, but Robbie expects a single consistent favorite meal.) Harriet’s preferences have changed. (E.g., a recent illness changed the physical mechanisms by which she tastes food.) Harriet does not know/is uncertain about her preferences. (Harriet has never tried durian. Robbie knows Harriet’s genetic profile means she’ll probably enjoy durian, but Harriet has only heard it described by people who hate it and so is hesitant to risk it.) Harriet’s preferences are based on a false model of the world. (Harriet thinks acai berries are a cure-all, but they are not.) Harriet is almost completely irrational. (Harriet is two years old, or experiencing a psychotic break, or a compulsive liar, or…) Solutions to each of these scenarios are proposed in the book. Some are solutions-in-principle that need further work to fill out; others seem to have real solutions already in use. Regardless, my worry is not solving these cases individually; it is how you can tell the cases apart , since their solutions are very different. For instance, case 1 requires Robbie to invent new parameters for his model, case 2 is best handled by a reset of Robbie’s priors about Harriet’s tastes in food (while not touching other preference categories), and case 5 requires that Robbie mostly ignore Harriet’s stated preferences. Now, a self-reflective and communicative Harriet working with an insightful and communicative Robbie could probably work out which case is relevant between them (though, again, we have the problem that Robbie must already be smart to achieve this). But what if communicating with the user isn’t possible? Maybe Harriet is terrible at self-reflection and self-expression. Or maybe Robbie is serving not the individual Harriet but the nation of Hungary (population ~10 million). It is unlikely to be practical to communicate with each citizen at length, and unlikelier still that the zeitgeist of the nation will hold conversations with Robbie about why, all of a sudden, there is a shift in public opinion on a previously well-decided matter. How, then, does Robbie determine the cause of the sudden change, and thus the correct strategy for responding? submitted by /u/ElephantWithAnxiety
Originally posted by u/ElephantWithAnxiety on r/ArtificialInteligence
