Author’s Note: Aimee, you’re writing about AI AGAIN?! Yes. Yes, I am. Sorry, but Twitter is just such a terrible place for nuance and I work in software and it is making me wild over here seeing all the confusion with the tech. That said, this blog really doesn’t encompass even a small fraction of the data ethics issues at play currently, even in publishing. It also doesn’t address many of the AI Models out there, again, only the ones relevant to the exact conversation which is Generative AI and LLM data scraping practices and the harm that’s causing in publishing (artists and authors).
Work Disclaimer: These opinions are my own and are not intended to represent the views or opinions of my employer.
AI, oh my! It’s all over everything everywhere. And everyone has an opinion about it. Usually love it or hate it. Not often anywhere in between.
Shocking no one, I’m the someone in between.
As a VP of Compliance at a software company who spends my days in the trenches of the AI debate, I get it. Some people want to poke at it with a fifty-foot pole while squinting. Some people can’t gobble the tech up fast enough before demanding more.
Almost no one wants to take time to understand the nuance of the intersecting fields of AI. Definitely no one wants to hear about the people behind them.
You know I’m not going to let that slide, right?
Brief History of AI and AI Ethics

As we can see, AI has been around awhile. And while the federal government (in the United States, anyway) hasn’t been doing much, some states have (California most notably but also Colorado, Virginia, Utah, and Connecticut with more to come no doubt), as have other countries, the biggest note being the European Union’s GDPR.
Interestingly, writers have been debating AI and robotics ethics far longer than legislators. Isaac Asimov (yep, the science fiction writer), penned an ethical code for AI in 1942 in his short story Runaround. That ethical code is actually used as a real framework for some AI ethical concepts to this day. However, until the early 2000s, pretty much the only ones concerned about AI ethics or safety were uh… well, writers. While some scientists and engineers did discuss technological safety, and even published papers on it, until the new millennium, it was primarily still a speculative element portrayed in fiction and movies.
In 2000, the Singularity Institute for Artificial Intelligence (now the Machine Intelligence Research Institute) was created with the purpose of identifying and managing potential risks from artifical intelligence. From there, the AI Ethics movement took off.
Over the next decade, more research institutes focused on AI ethics were founded. Some to prop up the tech companies sponsoring them. Some to do real work. Time sorted some of them out, others still have to be sorted. What the government failed to do, consumers began to take charge of.
In short, capitalism did what capitalism does best: rules by purse.
Now, larger tech corporations hire for positions like “Ethical AI Architects,” “AI Ethics Program Managers,” and “Chief AI Ethics Officers,” and develop responsible AI frameworks and ethical AI policies. Major universities like Northeastern, Berkeley, and Duke, offer post doctoral research programs for AI Ethicists and Ethical AI. Conferences are held across industries to discuss AI and what to do with it. Who can go a day without seeing an email about the latest in AI developments or scandals? All right, that might be just my job, but I’m pretty sure every industry is feeling and seeing the impacts of not only the AI boom but the Responsible AI Revolution.
Great. Tech is teching. Researchers are researching. Capitalists are capitalizing. Legislators are procrasting. But what the hell does it all mean?

All the Techbro Lingo (Definitions First, Stay with Me)
We use this sweeping phrase “AI” as in artificial intelligence to encompass a huge amount of technology, much of which we use every day and have been using our entire lives without ever caring about. Whether we should’ve cared about it is another question, but let’s set that aside for right now. Some definitions. I swear I’ll apply them right after this, I just want to be clear that AI isn’t AI isn’t AI. There’s loads of different kinds and honestly, some of it isn’t that fancy (some of the least fancy stuff I haven’t even included here).
Artificial Intelligence (“AI”): A field of study within computer science concerned with developing computer systems that perform specific functions or tasks that would normally require human intelligence. Remember, the first AI was literally a program that played checkers. Not fancy. I mean, it was to the computer developer who made it, but not to the average human who has a brain that intrinsically knows how to learn stuff like how to play checkers.
AI Model: The use or creation, training, and deployment of machine learning algorithms that emulate logical decision-making based on available data. The AI Model being used by any given system is what needs to be best understood. Not that it’s AI. Not all possible AI Models are listed here, just the ones relevant to what’s discussed in this blog.
Foundation Model: A large AI Model, Pre-Trained (often on many different data formats like text, images, video, speech, numbers, etc.) which can then be used as the basis for many different types of tasks and operations.
Generative AI: An AI Model that can be used to create new content reflective of the general characteristics of the training data without duplicating the training data.
Large Language Model (“LLM”): An AI Model which can analyze massive volumes of data resulting in the LLM learning to create sophisticated output. This is the modeling system used in Generative AI such as Chatbots like ChatGPT.
Pre-Trained: Also called Transfer Learning this machine learning method reuses a model previously used for one task to reduce cost and time and fine-tune tasks.
Web scraping: Also referred to as data scraping this is the process of using bots to extract content and data from a website to be replicated elsewhere.
Turning this Nonsense into Useful Information: Examples
There are two things to keep in mind when we’re talking about AI: (1) What system(s) is/are being used; and (2) Following the training data.
Example: Generative AI Using Only Your Data
Generative AI is bad.
Well, not necessarily. If I’m an artist, or an author, and I build my own Generative AI system and keep it closed (e.g. the only data ever used is mine, there is nothing Pre-trained and no Foundation Model), then all it’s ever replicating is my work. If it spits out some derivitative based on my work, I edit it, tweak it, alter it further, make it just so, is that really bad, or is it a new way to do art? Is it any different than a computer speeding up the typing process or Photoshop (also a form of AI, by the way) helping a photographer touch up their photo, or a digital notepad for a graphic designer?
Note: This is not the same thing as training on a “closed system.” When you hear someone in software say they’re using a closed system, they mean the opposite of open source in that the specifications are being kept secret and unable to be modified or used by third parties. These still often use open source or purchased Foundation Models or Pre-Trained bases of some sort (which bases need data to create, remember). The company then inputs their data into those bases for specific customization.

Example: Generative AI Use in the Same or Similar Industry
In my opinion, the clearest issue of do not do right now in publishing is the use of Generative AI in the same or similar industries. What does this mean?
Well, in the publishing industry right now, there are software products that use Generative AI trained on data scraped from artists’ and graphic designers’ websites and social media pages without their consent (and often knowledge) that produce cover art. Some of these artists do, could, or would make cover art given the opportunity. Using software that does it instead of using a real artist causes economic harm to the real artists. For authors to use this kind of art is an ethical issue in my opinion, because authors’ work can be similarly data scraped and used to train software to create books, so they should have the same concern and care to protect their fellows within the same industry.
Reminder: This can be complicated in traditional publishing because authors often don’t have much control over their covers.
Exception: Bria AI, an Israel-based AI image generator has partnered with Getty Images (to much applause from data ethicists) to create AI models, including Generative AI models, trained only from licensed content. In addition to only training AI on images licensed for that use, Bria AI is implementing technology which will compensate the original creators when the AI platform generates images based on their photos.

Example: Generative AI in a Non-Adjacent Industry
Finally, we get to the squishiest ethical loop of them all. What do we do about… everything else?
We don’t know what is being data scraped or for what purpose, really. Assume everything. Then we have no idea to what extent data is being used to generate Foundation Models for all sorts of systems the original creators never intended their work to find its way into it. And because it’s being taken without the creators’ knowledge, aggregated with billions of other bits of data, then spit out in a new form in systems all over the world, the creator has zero control over their work.
This blog, for example, could be being data scraped. Bots might be taking this content and aggregating it as part of the dataset for an LLM. Let’s say they are. I don’t really care about that because this blog isn’t something I want to sell. Doesn’t matter what I care about though. Into the dataset it goes.
Let’s say instead, I have the first chapter of my copyrighted, published book posted on my website, which is something many self-published authors (and some traditionally published authors depending on contract) do to entice readers to buy their books. I definitely don’t want that going just anywhere. That’s my product. I’m actively trying to sell it. Still doesn’t matter. Into the dataset it goes.
Maybe that dataset is part of an LLM that’s then used to generate a book. I think you can probably see why an author might have an issue with that even if you want to argue it’s not illegal. That falls under the example I discussed above, really.
What if it’s used not to create a book, though, but to create a Foundation Model or a Pre-Trained base for something in a totally different industry? What if all the developer wants is to grab your writing to train its AI to sound human so a company in, say, the pharmaceutical industry, can then input its own drug data into it to have a chatbot educate doctors about its products? Have you ever read a pharmaceutical insert? Training a chatbot on that information alone wouldn’t make it very well… chatty. To have chatbots that can successfully interact with humans, companies need AI trained on something that’s read dialogue. Lots of it.
Now, this hypothetical pharmaceutical company will make money off the chatbot, have no doubt. It’s selling their product with decreased margins!
But what about the people who taught the bot to speak? The people like me who have expensive degrees in English and qualifications and literary agents and copyrighted or copyrightable material that’s been scraped without consent? Who have read thousands of books to learn how to do what they do? Who have written millions of words to get where they are? Who have labored over their craft for hours for no money? Should we get paid now? Does this “count” as someone making a profit off our labor? Our expertise? Our intellectual property? Were we inadvertent consultants to the bot? Should we be paid for our time? But how do you quanitfy that when it’s split into billions? What if some of it was work we intended to sell while some of it wasn’t?
Then what about the people not at all like me? The people who have none of those things but still created original content with zero intent to sell it? They still, arguably, had intellectual property data scraped without their consent then commercialized. What about the rapid fire tweeters who harbor dreams of going viral? Social media mom groups who would love to have sponsorship? Who doesn’t have a secret dream putting something random on the internet and getting rich somehow? And who gets to say whose work matters more, is worth more? When does it become “stealing?” When does it become “sellable?”
The honest answer for me is I’m not sure. I don’t know if the solution is to ban data scraping entirely or if we even can. I suspect it’s uncontrollable, though putting control back into the hands of creators is in fact probably the ideal solution. We focuson PII and PHI in all these laws but not all of our content. Getting control over anything and everything we don’t want fed to mystery machines is the utopian solution, in my opinion. Now, whether that’s feasible or practical and what impact that might have on slowing down what tech is doing positively in so many industries and if I have a moral issue with slowing that down if my work feeding a chatbot for pharma means they have more resources for cancer research… I have no idea. I suspect time will tell for us all.
And I didn’t even get to bias or black boxes or bad data or hallucinations or what happens if your scraped content when aggregated ends up generating something you are so morally opposed to it…
I guess it’s like our parents’ always said: Be careful what you put on the internet, kids.






