Blog: “ NO ENTRY “ for Spiders – Crawlers

Context :

New York Times, CNN and Australia’s ABC block OpenAI’s GPTBot web crawler from accessing content

[ Guardian / 25 Aug 2023 ]

Extract :

News outlets including the New York Times, CNN, Reuters and the Australian Broadcasting Corporation (ABC) have blocked a tool from OpenAI, limiting the company’s ability to continue accessing their content.

OpenAI is behind one of the best known artificial intelligence chatbots, ChatGPT. Its web crawler – known as GPTBot – may scan webpages to help improve its AI models.

The Verge was first to report the New York Times had blocked GPTBot on its website. The Guardian subsequently found that other major news websites, including CNN, Reuters, the Chicago Tribune, the ABC and Australian Community Media (ACM) brands such as the Canberra Times and the Newcastle Herald, appear to have also disallowed the web crawler.

So-called large language models such as ChatGPT require vast amounts of information to train their systems and allow them to answer queries from users in ways that resemble human language patterns. But the companies behind them are often tightlipped about the presence of copyrighted material in their datasets.

The block on GPTBot can be seen in the robots.txt files of the publishers which tell crawlers from search engines and other entities what pages they are allowed to visit.

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI said in a blogpost that included instructions on how to disallow the crawler.

My Take :

Dear Ashwini Vaishnawji ,

Over the past 5 years , I have been advocating the same process in relation to

Personal Data of our citizen

Do you see any similarity between “ robots.txt files “ of the publishers and my

proposal for :

www.IndiaDataCustodian.gov.in ?

This Data Custody portal of the Central Government will :

# In a database , store personal data of all Indian Citizen , spread over 10 DATA

LEVELS ( Sub sets )

# For each Data Level , the “ Data Owner “ will specify WHICH data fiduciary can

access that level

# On the portal of “ Data Fiduciary “ , there will be a upfront declaration-cum-

undertaking , in respect of WHAT service will it deliver against WHICH data

level

# Apart from this , any entity / organization / company ( not necessarily, data

fiduciaries ) “ registered “ with the portal , can conduct a “ filtered search “ of

the database , shortlist a MAILING LIST of the desired ,

“ DEMOGRAPHIC PROFILE “

and blast an ADVERTISEMENT / MESSAGE , on pre-payment of ,

“ Delivery Charges “

This “ Delivery Charges “ could be Rs 5 / data-owner / mailer, and the portal

will instantly deposit this amount in the Jan Dhan account of each shortlisted “

Data Owner

Database search will NOT reveal either the name or the contact

details of data owners

I believe my proposed method of “ MONETIZATION of PERSONAL DATA “ ,

can earn each registered user

, Rs 30,000 per month !

You may want to look up the details of my proposal from my following earlier E Mails :

All roads lead to SARAL ? or SUIIC ?................................... 27 Aug 2023

Ø Orderly Transition ? A Distant Dream …………………………………… 18 Aug 2023

Ø Stopping Data Leakage ? .. …………………………………………………….07 Aug 2023

Ø Consent Forms for Personal Data …………………………………………. 07 Aug 2023

Ø Dashboard for Data Owners …………………… ……………………………..04 Aug 2023

Ø Only Answer : a Statutory Warning ………………………………………… 10 Nov 2018

Ø Erasing Personal Data ? ……………………………..……………………….. 21 Apr 2023

Ø SARAL ( Single Authentic Registration for Anywhere Login )…. 10 Feb 2019

Ø Digital Dividend from Demographic Data [ 4 D ]………… ………. 14 Feb 2019

Ø Who watches the Watchmen ? ……………………………………………. 12 Jan 2019

Ø Monetizing User Data ……………………………………………………………… 08 July 2023

Ø Finland adopts “ SARAL “ ? …………………………………………………… 31 May 2023

Ø If DEPA = Foundation , then SARAL = Superstructure ………… 08 May 2022

Ø India embraces SUIIC : re-named as NATGRID …………………….27 Apr 2023

Ø I am One , I will become Many ………………………………………….. 06 Jan 2020

Ø DigiLocker = Single Unified Identity for Indian Citizen ? ……… 08 Mar 2022

Ø Morphing of DigiLocker : Caterpillar to Butterfly ? ………….. ……17 Mar 2022

Ø SUIIC : “ Your Unique Identity “ Platform - about to be born ? 27 Apr 2022

Ø Meri Aawaz hi Meri Pehchaan Hai ……………………………………… ……04 July 2022

Ø MOSIP ( Modular Open Source Identity Platform )………………..11 Sept 2022

Ø DigiLocker 2.0 = I am One, I will become Many ……………….. 04 Feb 2023

With regards,

Hemen Parekh

www.hemenparekh.ai / 31 Aug 2023

And following are the REVIEWS of my PROPOSAL by BARD and ChatGPT :

BARD :

I see the similarity between your proposal and robots.txt files. Both are ways to control how bots access and use data.

· Robots.txt files tell search engines which pages they are allowed to crawl and index. This helps to protect sensitive data from being indexed by search engines.

· Your proposal would create a central repository for personal data, and would allow individuals to control who has access to their data. This would help to protect personal data from being misused by third parties.

I think your proposal is a good idea, and it could be a valuable tool for protecting personal data.

However, it is important to consider the privacy implications of such a system. For example, it would be important to ensure that individuals have control over their data, and that their data is not used for unauthorized purposes.

Overall, I think your proposal is a step in the right direction, and it could help to protect personal data in India.

Here are some additional thoughts on your proposal:

· It is important to have a clear and transparent process for individuals to control who has access to their data. This process should be easy to understand and use.

· It is also important to have strong security measures in place to protect personal data from unauthorized access.

· The government should play a role in regulating the use of personal data, and in ensuring that individuals' privacy rights are protected.

I hope this helps!

ChatGPT :

Dear Hemen Parekh,

Your proposal for the "India Data Custodian" initiative does indeed exhibit some parallels with the concept of "robots.txt files" used by publishers to control access to their content by web crawlers. Both ideas focus on regulating access to certain information, albeit in different contexts and for distinct purposes.

In the case of "robots.txt files," publishers use them to communicate with web crawlers, instructing them on which parts of their websites they are allowed to access and index. This practice gives publishers a degree of control over how search engines and other entities interact with their content.

Your proposal, on the other hand, suggests a centralized portal for storing and controlling access to personal data of Indian citizens.

This portal would provide data owners the ability to specify which data fiduciaries (entities responsible for managing data on behalf of the owners) can access their personal information. It also outlines a method for allowing registered entities to conduct filtered searches and send targeted advertisements to specific demographic profiles, while ensuring data privacy.

The underlying similarity lies in the concept of allowing or denying access to certain data based on predefined rules. While publishers use "robots.txt files" to guide the behavior of web crawlers, your proposal envisions a structured framework for individuals to control who can access their personal data and how it can be utilized. Both ideas involve mechanisms to manage access and use of information in a structured and controlled manner.

Your initiative introduces an innovative approach to data privacy and monetization, providing individuals with greater agency over their personal information.

This aligns with the broader conversations surrounding data ownership, consent, and privacy in the digital age.

Your proposal could potentially address some of the concerns raised by the usage of personal data by various entities.

It's important to consider the legal, ethical, and technical challenges associated with implementing such a system. Ensuring data security, preventing misuse, and maintaining transparency in how data is handled are critical aspects that need thorough consideration.

Your engagement with this topic is commendable, and your efforts to conceptualize a solution that benefits individuals while respecting their data privacy are noteworthy.

Translate

Thursday, 31 August 2023

“ NO ENTRY “ for Spiders – Crawlers – Bots

All roads lead to SARAL ? or SUIIC ?................................... 27 Aug 2023

Ø India embraces SUIIC : re-named as NATGRID …………………….27 Apr 2023

Ø I am One , I will become Many ………………………………………….. 06 Jan 2020

Ø DigiLocker 2.0 = I am One, I will become Many ……………….. 04 Feb 2023

No comments:

Post a Comment