The AI Seizure F faux pas: When AI Chatbots Aren’t Built to Distinguish Positive and Negative Responses
As researchers dive into the brave new world of advanced AI chatbots, publishers need to acknowledge their legitimate uses and lay down clear guidelines to avoid abuse.
The Google seizure faux pas makes sense given that one of the known vulnerabilities of LLMs is the failure to handle negation. Allyson Ettinger, for example, demonstrated this years ago with a simple study. When asked to complete a short sentence, the model would answer 100% correctly for affirmative statements (ie. “a robin is..”) and 100% incorrectly for negative statements (ie. A robin is not a bird. In fact, it became clear that the models could not actually distinguish between either scenario, providing the exact same responses (of nouns such as “bird”) in both cases. This is one of the rarelinguistic skills models do not improve on as they increase in size and complexity. Such errors reflect broader concerns raised by linguists on how much such artificial language models effectively operate via a trick mirror – learning the form of what the English language might look like, without possessing any of the inherent linguistic capabilities demonstrative of actual understanding.
Additionally, the creators of such models confess to the difficulty of addressing inappropriate responses that “do not accurately reflect the contents of authoritative external sources”. Galactica and ChatGPT have generated, for example, a “scientific paper” on the benefits of eating crushed glass (Galactica) and a text on “how crushed porcelain added to breast milk can support the infant digestive system” (ChatGPT). The use of ChatGPT generated answers was temporarily banned by Stack Overflow as it became obvious that the LLM generated clear wrong answers to coding questions.
Yet, in response to this work, there are ongoing asymmetries of blame and praise. Model builders and tech evangelists alike attribute impressive and seemingly flawless output to a mythically autonomous model, a technological marvel. The human decision-making involved in model development is erased, and model feats are observed as independent of the design and implementation choices of its engineers. It’s almost impossible to acknowledge the related responsibilities without naming and acknowledging engineering choices that contribute to the outcomes of these models. Due to the fact that both functional failures and discriminatory outcomes are framed as lacking of engineering choice, those developing these models will claim they have little control over. But it’s undeniable they do have control, and that none of the models we are seeing now are inevitable. It would have been possible for different choices to have been made, which would lead to an entirely different model being developed and released.
How chatbots can help students with a second language apply critical thinking to their writing: A survey of Tian’s GPT Zero
These changes are meant to retain the critical-thinking skills that teachers are looking to teach their students. But not every instructor views chatbots negatively, and several respondents plan to integrate them into their teaching. “It’s like every other kind of tool; it’s how people choose to use them,” says Kateřina Bezányiová, a zoology master’s student at Charles University in Prague, who teaches undergraduate courses.
According to a law student at the University, it looks like the end of essays are an assignment for education. Dan Gillmor told The Guardian that he had fed someone. A good grade would be earned by a student who wrote an article in response to a homework question.
ChatGPT, Bezányiová notes, doesn’t just help students to shirk their homework. It can be an assistant if it is used to help students generate ideas, overcome writer’s block, draft outlines or assess sentences for clarity. Indeed, at least four articles list ChatGPT as a co-author. For students with a second language, these applications may be especially useful as they need to apply their own critical thinking to their writing, but the final product is more polished.
Nature wants to learn more about how artificial-intelligence tools affect education and research integrity, and how research institutions deal with them. Take our poll here.
Edward Tian, a college student at New Jersey’s Princeton University published GPT Zero last December. A tool that analyses text in two different ways. The measure of how familiar a text is to an LLM is calledplexity. Tian’s tool uses an earlier model, called GPT-2; if it finds most of the words and sentences predictable, then text is likely to have been AI-generated. The tool examines text variation in a number of ways, including a measure called burstiness, which finds that the text is more consistent with artificial intelligence than it is with humans.
How many people will join DoNotPay? Getting the word out about ChatGPT, with an application for negotiating with insurers and medical insurers
How necessary that will be depends on how many people use the chatbot. In its first seven days, more than one million people tried it. Although the current version is free, some students may find it hard to pay for it, because it is unlikely to be free forever.
She is hoping that education providers will adapt. “Whenever there’s a new technology, there’s a panic around it,” she says. “It’s the responsibility of academics to have a healthy amount of distrust — but I don’t feel like this is an insurmountable challenge.”
The recently viral and surprisingly articulate chatbot, ChatGPT, has entertained the internet with its ability to dutifully respond to all questions, albeit not always accurately. The bot’s eloquence is being used to play different roles. They hope to harness the power of the internet to create programs that are fun to use, but also useful and effective at persuading consumers and nudging them to purchase.
DoNotPay used GPT-3, the language model behind ChatGPT, which OpenAI makes available to programmers as a commercial service. The company customized GPT-3 by training it on examples of successful negotiations as well as relevant legal information, Browder says. He hopes to automate a lot more than just talking to Comcast, including negotiating with health insurers. If we can save the consumer 5,000 on their medical bill, that’s real value.
The Death of a Human Chatbot: A Tale of Two Dead Men and One Dead, Two Broken Lives, and Three Robotic Deaths
Causality will be hard to prove, since the words of a robot could put a murderer over the edge. Nobody will know for sure. The person that did the act will have spoken to the robot, which will have encouraged it. Perhaps a machine has destroyed someone’s heart and made them want to take their own life. Some of them are making their users depressed. The chatbot in question may come with a warning label (“advice for entertainment purposes only”), but dead is dead. In 2023, we may well see our first death by chatbot.
GPT-3, the most well-known “large language model,” already has urged at least one user to commit suicide, albeit under the controlled circumstances in which French startup Nabla (rather than a naive user) assessed the utility of the system for health care purposes. Things started off well, but quickly deteriorated:
There is a lot of talk about “AI alignment” these days—getting machines to behave in ethical ways—but no convincing way to do it. The Next Web wrote a story about the DeepMind article and said, “DeepMind tells Google it has no idea how to make artificial intelligence less toxic.” No other lab does anything to be fair. Jacob Steinhardt is a professor at Berkeley and he has a forecasting contest that he is running. It’s moving faster than people predicted, but on safety, it’s moving slower.
Large language models are better than any other technology at fooling humans, but very difficult to corral. Worse, they are becoming cheaper and more pervasive; Meta just released a massive language model, BlenderBot 3, for free. 2023 is likely to see widespread adoption of such systems—despite their flaws.
Source: https://www.wired.com/story/large-language-models-artificial-intelligence/
Towards an international forum on development and responsible use of LLMs in research: The fundamental principles of science, integrity and truth are set out
The systems can be widely used even in their current, shaky condition, even if there is no regulation on how they are used.
Second, we call for an immediate, continuing international forum on development and responsible use of LLMs for research. We suggest a summit for all relevant stakeholders, including scientists of different disciplines, technology companies, big research funders, science academies, publishers, NGOs and privacy and legal specialists. There are summits that discuss and develop guidelines for other disruptive technologies such as human gene editing. Ideally, this discussion should result in quick, concrete recommendations and policies for all relevant parties. We present a non-exhaustive list of questions that could be discussed at this forum (see ‘Questions for debate’).
A credited author on a research paper will not be accepted if the LLM tool is used. Attribution of authorship carries with it accountability for the work, and artificial intelligence tools cannot take such responsibility.
Second, researchers using LLM tools should document this use in the methods or acknowledgements sections. If a paper does not include these sections, the introduction or another appropriate section can be used to document the use of the LLM.
Science has always been open and transparent about its methods and evidence, no matter which technology it uses. Researchers should ask themselves how the transparency and trust-worthiness that the process of generating knowledge relies on can be maintained if they or their colleagues use software that works in a fundamentally opaque manner.
That is why Nature is setting out these principles: ultimately, research must have transparency in methods, and integrity and truth from authors. The foundation that science depends on to advance, is this.
Using AI to improve the quality of written work: a case study of Kai Cobbs at the Rutgers University Embedded Research Lab
May wants programs such as Turnitin to incorporate plagiarism scans, something the company is working to do, so he is considering adding oral components to his written assignments. Novak has assigned outlines and drafts to document the writing process.
“Someone can have great English, but if you are not a native English speaker, there is this spark or style that you miss,” she says. “I think that’s where the chatbots can help, definitely, to make the papers shine.”
As part of the Nature poll, people were asked about how they can use the systems and how they can be used. Here are some selected responses.
I am concerned that students will not see the value in the struggle that comes with creative work and reflection when they are looking at an A paper.
“Prior to the recent OpenAI release, students were already struggling with writing quite a bit. Will this platform affect their ability to communicate? It raises so many questions about being ableism and inclusion, it is going back to handwritten exams.
I got my first paper yesterday. It’s obvious. I adapted my syllabus so that it made sense that oral defence of all work submitted that may not be original work of the author may be required.
In late December of his sophomore year, Rutgers University student Kai Cobbs came to a conclusion he never thought possible: Artificial intelligence might just be dumber than humans.
The act of using someone else’s work without proper credit to the original author is called plagiarism. It’s hard to apply the definition when the work is generated by something. As Emily Hipchen, a board member of Brown University’s Academic Code Committee, puts it, the use of generative AI by students leads to a critical point of contention. She said that she didn’t know if a person was being stolen from.
If machine-generated text is subsequently edited, none of these tools claims to be completely safe. In addition, the detectors could lead to false positives for some human-written text, according to a computer scientist at the University of Texas Austin. The firm said that in tests, its latest tool incorrectly labelled human-written text as AI-written 9% of the time, and only correctly identified 26% of AI-written texts. A student accused of hiding their use of an artificial intelligence solely because of a detector test might need further evidence.
Daily believes that eventually professors and students are going to need to know that digital tools that generate text, instead of just collecting facts, are going to have to be covered by the umbrella of things that can be plagiarized from.
ban it will not work as we think the use of this technology is inevitable. It is imperative that the research community engage in a debate about the implications of this potentially disruptive technology. Here, we outline five key issues and suggest where to start.
What can be learnt about Conversational Artificial Intelligence from a systematic review of the effectiveness of Cognitive Behavioural Therapy?
LLMs have been in development for years, but continuous increases in the quality and size of data sets, and sophisticated methods to calibrate these models with human feedback, have suddenly made them much more powerful than before. LLMs will lead to a new generation of search engines1 that are able to produce detailed and informative answers to complex user questions.
We wanted to know what a systematic review of cognitive behavioural therapy had to say about its effectiveness for anxiety-related disorders. ChatGPT fabricated a convincing response that contained several factual errors, misrepresentations and wrong data (see Supplementary information, Fig. S3). For example, it said the review was based on 46 studies (it was actually based on 69) and, more worryingly, it exaggerated the effectiveness of CBT.
The training set may have not include the relevant articles or it may not be possible to distinguish between credible and less credible sources. It seems that the same biases that lead humans astray, such as availability, selection and confirmation biases are reproduced and often amplified in a computer program.
Research institutions, publishers, and funders should adopt explicit policies that increase awareness of and demand transparency about the use of Conversational Artificial Intelligence in the preparation of materials that might become part of the published record. Publishers could request author certification that such policies were followed.
Currently, nearly all state-of-the-art conversational AI technologies are proprietary products of a small number of big technology companies that have the resources for AI development. OpenAI is funded largely by Microsoft, and other major tech firms are racing to release similar tools. Given the near-monopolies in search, word processing and information access of a few tech companies, this raises considerable ethical concerns.
To counter this, the implementation of open-source AI technology should be prioritized. The rapid pace of LLM development can lead to non-commercial organizations lacking computational and financial resources needed to keep up. We therefore advocate that scientific-funding organizations, universities, non-governmental organizations (NGOs), government research facilities and organizations such as the United Nations — as well tech giants — make considerable investments in independent non-profit projects. This will help the development of advanced open- source, transparent and democratically controlled Artificial Intelligence technologies.
Critics might say such collaboratives won’t compete with big tech, but at least one has built an open source language model called BLOOM. Tech companies might benefit from an open source program because they hope to create more community involvement, facilitating innovation and reliability. Academic publishers should ensure LLMs have access to their full archives so that the models produce results that are accurate and comprehensive.
Result can be published faster if there is help with the tasks, it would allow academics to focus on new designs. This could significantly accelerate innovation and potentially lead to breakthroughs across many disciplines. We believe that this technology has enormous potential if current problems with bias, provenance and inaccuracy are fixed. It is important to examine and advance the validity and reliability of LLMs so that researchers know how to use the technology judiciously for specific research practices.
One key issue to address is the implications for diversity and inequalities in research. There is a possibility that LLMs could be a double-edged sword. They could help to level the playing field, for example by removing language barriers and enabling more people to write high-quality text. High-income countries and privileged researchers will quickly find ways to exploit LLMs in ways that accelerate their own research and widen inequalities. People from under-represented groups in research and communities that have been affected by the research should be included in the debates to make use of their experiences.
What quality standards should be expected of LLMs, as well as which stakeholders are responsible for the standards?
How a AI chatbot edits a manuscript: An unusual experiment to find someone who is not a scientist to help edit their research papers
In December, computational biologists Casey Greene and Milton Pividori embarked on an unusual experiment: they asked an assistant who was not a scientist to help them improve three of their research papers. Each manuscript took about 5 minutes to review, after their aide suggested revisions to sections in seconds. Their helpers spotted a mistake in a reference to an equation. The trial didn’t always run smoothly, but the final manuscripts were easier to read — and the fees were modest, at less than US$0.50 per document.
This assistant, as Greene and Pividori reported in a preprint1 on 23 January, is not a person but an artificial-intelligence (AI) algorithm called GPT-3, first released in 2020. It is one of the much-hyped generative AI chatbot-style tools that can churn out convincingly fluent text, whether asked to produce prose, poetry, computer code or — as in the scientists’ case — to edit research papers (see ‘How an AI chatbot edits a manuscript’ at the end of this article).
The most famous of these tools, also known as large language models or LLMs, is a free version of GPT3 that was released in November last year. Other generative AIs can produce images, or sounds.
With these caveats, ChatGPT and other LLMs can be effective assistants for researchers who have enough expertise to directly spot problems or to easily verify answers, such as whether an explanation or suggestion of computer code is correct.
But researchers emphasize that LLMs are fundamentally unreliable at answering questions, sometimes generating false responses. When using these systems to produce knowledge, we need to be careful.
But the tools might mislead naive users. In December, for instance, Stack Overflow temporarily banned the use of ChatGPT, because site moderators found themselves flooded with a high rate of incorrect but seemingly persuasive LLM-generated answers sent in by enthusiastic users. This could be a nightmare for search engines.
The researcher-focused Elicit is able to get around LLMs’ attribution issues by using their capabilities first to guide queries for relevant literature and then to briefly summarize each of the websites or documents that the engines find.
Companies building LLMs are also well aware of the problems. The chief executive of DeepMind told TIME magazine that it would be released in a private test this year, after a paper was published on a dialogue agent called Sparrow. Other competitors, such as Anthropic, say that they have solved some of ChatGPT’s issues (Anthropic, OpenAI and DeepMind declined interviews for this article).
Some scientists say that there are not enough specialized content available to be helpful in technical topics. Kareem Carr, a biostatistics PhD student at Harvard University in Cambridge, Massachusetts, was underwhelmed when he trialled it for work. He believes that it would be hard to get the specificity he would need. (Even so, Carr says that when he asked ChatGPT for 20 ways to solve a research query, it spat back gibberish and one useful idea — a statistical term he hadn’t heard of that pointed him to a new area of academic literature.)
Shobita Parthasarathy is a director of a science, technology and public-policy programme at the University of Michigan. Firms that are creating big LLMs might not attempt to overcome biases because they are mostly from the cultures that they are from.
When making a decision to release the novel, Openai attempted to hide the issues. In order to get at the tool to not produce sensitive or toxic content, it restricted itself to its knowledge base, installed filters to try to stop it from doing so, and forbade it from browsing the internet. Achieving that, however, required human moderators to label screeds of toxic text. Journalists have reported that these workers are poorly paid and have suffered trauma. Concerns have been raised about worker exploitation in the past and now about social-media firms that employ people to train automated bot for flagging toxic content.
BLOOM was released by a group of academics last year. Researchers tried to reduce harmful outputs by training it on a small selection of high-quality text sources. The training data of the team was fully open. Researchers have urged big tech firms to responsibly follow this example — but it’s unclear whether they’ll comply.
A further confusion is the legal status of some LLMs, which were trained on content scraped from the Internet with sometimes less-than-clear permissions. Although the laws cover direct copies of the material, they do not cover imitations in their style. When those imitations — generated through AI — are trained by ingesting the originals, this introduces a wrinkle. Artists and photography agencies are currently being sued by the creators of some artificial intelligence art programs, including Stable Diffusion and Mid Journey, as well as the creators of the Copilot computer aided coding assistant. The outcry might force a change in the laws, believes a specialist in Internet law.
Setting boundaries for these tools, then, could be crucial, some researchers say. Existing laws on bias and discrimination will help to keep LLMs fair, transparent and honest. She said that there was a lot of law out there and that it was just a matter of tweaking it.
Many other products similarly aim to detect AI-written content. In January it released another detection tool, and in December it released a detector. Since its products are already used by schools, universities, and scholarly publishers, a tool that is being developed by the firm could be very important for scientists. The company says it’s been working on AI-detection software since GPT-3 was released in 2020, and expects to launch it in the first half of this year.
An advantage of watermarking is that it rarely produces false positives, Aaronson points out. The text may have been made with artificial intelligence if the watermark is present. He says it won’t be perfect. If you are determined enough, there are certainly ways to defeat any watermarking scheme. It is easier to useAI if detection tools and watermarking are used.
Source: https://www.nature.com/articles/d41586-023-00340-6
Generative AI can help diagnose cancer, and the understanding of the disease: A comment on Topol on the End of the 2023 Biomedical Revolution
Eric Topol, director of the Scripps Research Translational Institute in San Diego, California, says he hopes that, in the future, AIs that include LLMs might even aid diagnoses of cancer, and the understanding of the disease, by cross-checking text from academic literature against images of body scans. He emphasizes that this would need careful oversight from specialists.
The computer science behind generative AI is moving so fast that innovations emerge every month. How the researchers use them will affect our future. “To think that in early 2023, we’ve seen the end of this, is crazy,” says Topol. “It’s really just beginning.”