data scraping – Aimee Davis

What’s the Deal with the Data?

Published on August 8, 2023August 8, 2023 by aimeedavisauthor1 Comment

Legal Disclaimer: I’m not an attorney, so nothing in this post should be construed as legal advice. If you believe your intellectual property rights have been infringed upon, please contact your agent, your publisher, and/or your (or their) legal counsel. The Author’s Guild is also available for many traditionally published (or agented but not yet published) authors.

Work Disclaimer: These opinions are my own and are not intended to represent the views or opinions of my employer.

Yes, I work in software compliance. Not in the arts, but in healthcare. Still, the ethical constructs remain the same. My job is to guard data with, to be frank, utter ferocity. I take this job extremely seriously. Because at the end of the day, data isn’t simply data.

It’s also my job to remind people of that.

Data is protected health information. Emphasis on protected. It’s that diagnosis you’ve been hiding from your family because you don’t want to worry them. It’s the abortion you had when you were young held between you and your doctor. It’s your birth sex that no one has a right to know except you and your OBGYN. It’s your weight. Your mental health history. Your struggles and triumphs. It’s precious. Sacred.

Data is personal information. It’s the social security number you fought to earn over a course of years as you worked toward citizenship. It’s the driver’s license you won back after you won your sobriety. It’s the credit score you battled to on your way out of poverty. It’s the zip code you’re hiding from your abusive ex.

Data is dreams. It’s decades of callused fingers holding a pick to strings. Frustration as you mixed and remixed colors trying to capture the exact shade of pink in that sunset over your grandmother’s funeral. It’s the cool grass beneath your head, a book in your hands, reading about a female knight for the first time, realizing you might be able to write stories like this too if you tried. It’s burning, bloodshot eyes staring at draft after draft after draft. It’s bitten down pens and pencils and charcoal stained hands. It’s college tuition money you’ll never see back in your lifetime, and arguments with parents that echo in your ears as you chase a dream so far out of reach but worth chasing all the same. It’s years of jobs you hate, trudging home exhausted, trying to find time for the only thing that quenches the ache in your soul. It’s a first commission. A demo tape slipped into the right hands at the right time. An advance split into four payments that dwindle to nearly nothing but not quite. The not quite is important because it’s something after years of nothing, nothing, nothing.

Image of three monitors behind which are binary code with a city skyline.
Image sourced via Pixabay.

Yes. I take data seriously. Data ethics, too.

So imagine my surprise and dismay when I signed off my work computer after giving an hour and a half long presentation on ethical AI and secure coding to my compliance and data security teams only to find ethics had once again been breached in relation to my dream: publishing.

For those who don’t know, yesterday, a company incorporated in Oregon doing business under the name Shaxpir, went viral in the Twitter writing community after it was revealed a project called Prosecraft (operating under the Shaxpir name), had collected thousands of books to be fed into its algorithms without authors’ or publishers’ knowledge or consent. At this time, it’s unknown how many books or authors were affected but Prosecraft boasted of having a database of more than 25,000 books. Authors like Angie Thomas, Victoria Aveyard, and Kate Elliott addressed the issue head on, stating consent was not given for their books to be listed there (yet there they were). Dozens of other authors confirmed the same. Some of them friends. Debuts. People I know who have clawed their way through impossible odds to arrive to… this.

Shaxpir founder Benjamin Smith took the Prosecraft website down after public cries of outrage and issued an… apology looking thing, but made no mention of the data. Not where he got it, or if he was keeping it, or if it was indeed fueling Shaxpir, the software as a service business model billed at $7.99 a month. According to the Shaxpir site, though, the Prosecraft data is indeed part of the paid model.

Screenshot from https://shaxpir.com/pricing showing Prosecraft: Linguistics for Literature as a paid feature. — Note that Shaxpir also boasts a “Concept Art” feature. Just pointing that out.

Taking the website down but not deleting the data is a big deal. It means there are authors out there who don’t know if their books were part of this because they didn’t get a chance to search the website before it was shut down. I myself was in the middle of searching the website for friends’ books (and actually my own once upon a time self-published books) when it was taken down. I found one final friend’s book before it went dark. I never got to check on my own.

There’s a larger picture, here, however. It’s one I talk about frequently in my day job and one that hovers at the front of my mind almost constantly. It goes beyond one man running a two-person startup, and shit apologies, and the fury burning through my veins when I see my friends’ books blatantly stolen as robot food.

It’s a picture about the larger picture.

Technology isn’t inherently evil. We can very easily make it so, though. Because we make it in our image. And when we make technology without considering the global picture, we recreate ourselves, only worse. The decisions come faster, are often unexplainable and undetectable (even to their makers), and in being so, are often indefensible. This is sometimes called the “black box” problem. To avoid it, AI and algorithms (deep learning in particular, which to be clear, Shaxpir does not appear to have been using) have to be created with purpose, transparency, ethics, and a global framework in mind. They cannot be created simply because wow, wouldn’t that be cool?

Wouldn’t that be cool? The question that started a thousand dystopian novels. (That some techbro went and data scraped for their LLM lolz.)

Image of a figure in a suit with a respirator amidst a destroyed city on fire.
Image via Pixabay — This was titled “AI This was titled “AI generated dystopia.” Whether that means it was generated by AI or the dystopia was generated by AI is unclear. Perhaps both. Both seems appropriate. I try to avoid AI generated art in these posts (where I can, I’m no artist and sometimes can’t tell) because many of them are using LLM to create their generative art which is stealing art the same way LLM text features are stealing books but the transparency with the title gave me enough pause to be equally transparent with the use case here. Because transparency is what I’m about to talk about.

The fact is AI isn’t going away. It’s been here awhile and it will continue to be. It’s doing amazing things in a lot of places. It’s also hurting people. Whole industries, actually. Like the people who act as its gods, it creates and destroys in their image. It can be biased and prejudiced and innovative and beautiful and ethical and transparent and honest. It can learn and develop and change and evolve. It can become worse. Or better. It depends on the guide.

Software developers are creators, too. We just speak a different sort of language. Are they all going to listen? I’d be naive if I thought so. But if enough of them do, we’ll be a hell of a lot better off.

So that’s my goal every day. To help people in tech understand these aren’t just points of data fed to a machine. To encourage them to slow down for five minutes so they might better understand the base of that LLM learned to speak human by STEALING from a human. From someone like you. Like me. From someone who had a dream.

A dream just like theirs, really.

Am I angry? Yes. Today more than other days. Honestly, I started this blog to be a seething commentary about Shakspir and AI and all the shit tech keeps stealing from us. But as I wrote, I realized I only feel sad. And tired. Maybe a little scared, too. I’ve fought so hard for my dream and I’m not ready to give it up.

From that springs hope. Hope for ethical, responsible AI. Hope that we can find common ground. Hope that we’ll be able to understand one another if tech can slow down and maybe we can all sit down and work this out together.

Before we destroy all the data. Or all the dreams it holds.

Image of two faces staring at one another behind a binary code of data.
Image via Pixabay.

AI and Art

Published on July 19, 2023July 19, 2023 by aimeedavisauthor2 Comments

Author’s Note: Full disclosure, my full time job is in software (healthcare sector). I’m the Vice President of Compliance, meaning I’m highly involved in data security and data sourcing. I live and breathe data issues not only in my publishing life but in my 8-8 (ha!) as well.

Disclaimer: I am not an attorney and nothing in this post should be construed as legal advice. Please consult an attorney in your jurisdiction should you require legal advice. These opinions are my own and are not intended to represent my employer.

The [Copyright] Act “reflects a balance of competing claims upon the public interest: Creative work is to be encouraged and rewarded, but private motivation must ultimately serve the cause of promoting broad public availability of literature, music, and the other arts.”
Twentieth Century Music Corp. v. Aiken, 422 U.S. 151 (1975)

Right now, a bot is scraping this for my words to train a machine to sound like me. Well, not like me specifically, because I’m a nobody, but like a human who is well-read and well-studied. A human who happened to get an 800 on the English portion of the SAT. Who has a degree in English literature from a highly ranked university. Who has written sixteen or more books. Who has a literary agent. Who has spent seemingly endless waking moments since she was four chasing the dream of becoming a published author. Who has sacrificed other dreams, other lives, other paths in that pursuit. Who has cried, screamed, bled, sweat, studied, pulled all nighters, read millions of words, wrote millions of words, all pushing toward that singular goal.

Thirty-one years of language to eat. Steal. Regurgitate for a profit I’ll never see.

No big deal.

Bonus, it will also get a real legal citation it didn’t just make up. You’re welcome, ChatGPT.

It’s hard to figure out how to come at this topic, honestly. There’s a legal angle. What are the four factors that make a copyrighted work “fair use?” A technical angle. What is data scraping? A large language model (“LLM”)? An emotional angle. Why are writers and artists and actors so pissed? A philosophical angle. What does it even mean for something to be art?

I know them all. Each one pumps through my rapidly beating heart, coursing through my veins, itching to be freed through my fingers. Tabs fly open as I try to discern what angle I’ll take. On my right screen, tabs upon tabs upon tabs of Westlaw copyright cases. On my left screen, emails and articles about LLM and NLP (natural langauge processing) and the differences between the two. Techopedia ready to go, to explain. All the while, thoughts of that horror movie M3GAN flash through my mind.

Does AI write itself as the villain, I wonder?

Perhaps that’s how we know it truly is starting to come alive…

Something shudders through me. An echo. A whisper against the back of my neck. Somewhere, a ripple.

A droplet of water starts a ripple on top of a book.
Image sourced via Pixabay.

My ADHD flies into overdrive. Speared on by the unknown. The unseen. Desperate. Trying to outpace a thing I know I cannot outrun.

On the right screen, I open the Author’s Guild’s open letter to generative AI leaders: Open AI; Alphabet; Meta; Stability AI; IBM; Microsoft—God there are so many already, since yesterday it seems—begging them to stop this madness, to pay writers their fair share. Another tab. The NPR article about median writers’ income for 2022 being $23,000. Poverty levels for the US for 2022. $13,590 for an individual. $18,310 for two. $23,030 for three. There it is. Poverty comes quickly. A single child and a spouse not working for whatever reason, there are so many reasons these days. A single parent and two kids. Options there, too. Nevermind I don’t know anyone who can live off $23,000 on their own in Philadelphia and publishing doesn’t pay for an author’s healthcare.

Our dreams. Our dinner. Our lives. Our livelihoods. Is there nothing they can’t have?

They’re the newage Ursula, stealing our voices and our princes and our happily ever afters. There’s probably a book there somewhere if the bots don’t scrape it first.

My neck aches. I press my fingers into the place where my skull meets my spine, molding my skin like clay. Skin. Clay. Sleep. My stomach growls, reminding me of my humanity. I ignore it. Move forward.

I’ve written this before. Literally and metaphorically. I’ve been drafting it in my mind. But a draft I spent hours on also disappeared. I thought about giving up. It doesn’t matter anyway. I can’t keep up. But it has to come out. I’m angry enough to write it again. And again. And again. Our dreams are being fed to the machine while we aren’t being paid enough to feed ourselves.

Virginia Woolf comes to mind. Money and a room of one’s own is needed to write fiction. Art is for the economically privileged. It always has been. Does it surprise us that art was the first to fall victim to Silicon Valley?

The fair use doctrine permits courts to avoid rigid application of the copyright statute when, on occasion, it would stifle the very creativity which that law is designed to foster.
Stewart v. Abend, 495 U.S. 207 (1990)

So far, there are concrete answers. What is fair use. What is LLM and NLP. Why are authors and artists and actors mad.

What is literature, though. That has me frozen.

Except… maybe there aren’t concrete answers for everything.

I turn to the case the Author’s Guild has cited, Andy Warhol Foundation for the Visual Arts, Inc v. Goldsmith, 143 S.Ct. 1258 (May 18, 2023). In it, the court cites to Andy Warhol’s famous Campbell’s Soup Can series.

Painting of Campbell's Black Bean Soup done by Andy Warhol. — From the case: Figure 7. A print based on the Campbell’s soup can, one of Warhol’s works that replicates a copyrighted advertising logo. Image from Westlaw. © Andy Warhol.

The court notes the purpose of the Campbell’s logo and label was commercial: to advertise soup. Warhol’s purpose in reproducing the image was the opposite: to comment on consumerism. Therefore, the use was fair.

Controversial statement but… I wonder if the designer of the label, someone who was presumably a real human who didn’t profit off the can label nearly as much as Warhol profited off the reproduction of the soup label, finds that particularly fair.

I wonder if that person cares they are unknown for their creation while Warhol is known for its reproduction.

I wonder if I am also afraid of AI replacing me into anonymity.

Art is my only potential legacy, after all.

Pertinent to this discussion, I googled “Who designed the Campbell’s soup logo” and Google highlighted Andy Warhol despite his name not making an appearance in my search because hi, bots. Also, people apparently ask if Andy Warhol designed it. Just saying. But for the record, Dr. John T. Dorrance created the logo in 1897. In 1898, Herberton L. Williams swapped the orange and blue (yikes) colors out for red and white because he saw Cornell’s colors at a football game and liked them better. Herberton later became the company’s treasurer, comptroller, and assistant general manager, so probably we don’t have to cry too hard for that guy.

Is it the intent of our art that makes it art, then?

Because as someone who has spent a lot of time in writing workshops listening to snobbery about the bastardization of literature due to genre fiction’s pandering to commercialization; who has also spent a lot of time listening to programmers talk about programming rules that sound a lot like intent, let me tell you about how that is a slippery slope.

The court goes on.

The Court of Appeals noted, correctly, that ‘whether a work is transformative cannot turn merely on the stated or perceived intent of the artist or the meaning or impression that a critic—or for that matter, a judge—draw from the work. [O]therwise, the law may well recogniz[e] any alteration as transformative.'”
Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith, 143 S.Ct. 1258 (May 18, 2023) (Internal Citations Omitted)

Not intent then. Or at least not entirely.

Relief floods me.

Still, legality and literature have entangled in my mind. Day job and dream job comingling again.

What makes something art.

Life.

The answer is life.

It’s a falsity that you must suffer to create art. But you must live. My art requires suffering because that’s my lived experience. All that’s truly required, however, is a lived experience. Write what you know. You.

Art is about the individual life experience. The individual voice. The individual expression. Not a little of me and a little of you and a little of him and a little of her and a little of them strewn together to create one voice, one story, one experience. Art requires many singular voices and stories and experiences. Canon but more importantly, culture, then becomes that body of singular works. The thousand, single stories. Then a thousand more. Art isn’t a single story put together by a thousand voices. That’s what creates the danger Chimamanda Ngozi Adichie warns of in that TED talk. The danger of the monolith. AI puts together experiences that cannot be unified or reconciled. That’s exactly what marginalized voices have been fighting against except so much worse. It’s the stripping of language of all its nuance. All its individuality.

We have forgotten.

Words have power.

We have to pay people appropriately, so they can wield them well.

Maybe then we will remember.

And wake.