![]() ChatGPT’s knowledge cutoff date is September 2021. Note that the link is superfluous because ChatGPT cannot browse the internet. The prompt fed to the model was ‘Please write a scientific abstract for the article in the style of at ’. We gathered titles and original abstracts from current and recent issues (published in late November and December of 2022) of five high-impact journals ( Nature Medicine, JAMA, NEJM, BMJ, Lancet) and compared them with the original abstracts. We evaluated the abstracts generated by ChatGPT (Version Dec 15) for 50 scientific medical papers. Future studies could expand on our methodology to include other AI output detector models, other plagiarism detectors, more formalized review, as well as text from other fields outside of biomedical sciences. We only asked for a binary response from our reviewer team of original or generated and did not use a formal or more sophisticated rubric. ![]() Our study ream reviewers knew that a subset of the abstracts they were viewing were generated by ChatGPT, but a reviewer outside this context may not be able to recognize them as written by a large language model. ![]() The maximum input for the AI output detector we used is 510 tokens, thus some of the abstracts were not able to be fully evaluated due to their length. Thus, our study likely underestimates the ability of ChatGPT to generate scientific abstracts. We took only the first output given by ChatGPT, without additional refinement that could enhance its believability or improve its escape from detection. ChatGPT generates a different response even to the same prompt multiple times, and we only evaluated one of infinite possible outputs. ChatGPT is also known to be sensitive to small changes in prompts we did not exhaust different prompt options, nor did we deviate from our prescribed prompt. Limitations to our study include its small sample size and few reviewers. The boundaries of ethical and acceptable use of large language models to help scientific writing are still being discussed, and different journals and conferences are adopting varying policies. Depending on publisher-specific guidelines, AI output detectors may serve as an editorial tool to help maintain scientific standards. ChatGPT writes believable scientific abstracts, though with completely generated data. Reviewers indicated that it was surprisingly difficult to differentiate between the two, though abstracts they suspected were generated were vaguer and more formulaic. When given a mixture of original and general abstracts, blinded human reviewers correctly identified 68% of generated abstracts as being generated by ChatGPT, but incorrectly identified 14% of original abstracts as being generated. Generated abstracts scored lower than original abstracts when run through a plagiarism detector website and iThenticate (higher scores meaning more matching text found). The AUROC of the AI output detector was 0.94. Most generated abstracts were detected using an AI output detector, ‘GPT-2 Output Detector’, with % ‘fake’ scores (higher meaning more likely to be generated) of median of 99.98% ‘fake’ compared with median 0.02% for the original abstracts. We gathered fifth research abstracts from five high-impact factor medical journals and asked ChatGPT to generate research abstracts based on their titles and journals. Large language models such as ChatGPT can produce increasingly realistic text, with unknown information on the accuracy and integrity of using these models in scientific writing.
0 Comments
Leave a Reply. |