Jump to content

Blog scraping: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
‎ reffed WP:FEB24 #149
Merged content to Web scraping, following AfD-merge; see Wikipedia:Articles for deletion/Blog scraping (easy-merge)
Tag: New redirect
Line 1: Line 1:
{{Afd-merge to|Web scraping|Blog scraping|27 October 2023}}
#REDIRECT [[Web scraping]]
'''Blog scraping''' is the process of scanning through a large number of [[blog]]s, usually through the use of automated software, searching for and copying content. The software and the individuals who run the software are sometimes referred to as blog scrapers.


{{R from merge}}
Blog scraping is copying a blog, or blog content, that is not owned by the individual initiating the scraping process. If the material is copyrighted it is considered [[copyright infringement]], unless there is a license relaxing the copyright or the country has fair-use or private use law. The scraped content is often used on spam blogs or [[splog]]s, such places are called [[Scraper site|scraper sites]].

==History==
After the birth of the [[History of the World Wide Web|'''World Wide Web''']] in 1989, the first web robot,<ref>{{cite web |url=http://www.searchenginehistory.com/ |title=Search Engine History.com |website=Search Engine History |access-date=November 26, 2019}}</ref> '''[[World Wide Web Wanderer]]''', was created in June 1993, which was intended only to measure the size of the web.

==Issues==
A blog scraper who gathers content that is [[copyright]]ed material can be considered in violation of the law, depending on the case, data usage and country. Blog scraping can create problems for the individual or business who owns the blog. Blog scraping is particularly worrisome for business owners and business bloggers. Scrapers can copy an entire post from an independent or business blog. The duplicated content will include the author's tag and a link back to the author's site (if that link appears in the author's tag). However, most blog scrapers copy only a portion of the content that is keyword-relevant to their splog topic. By doing this, the keyword relevancy of the scraper's site is increased. Secondly, by not scraping the entire post, any outbound links are eliminated which means their [[search engine]] ranking is not reduced.

Additionally, scraped content can appear on literally any type of splog or [[RSS (file format)|RSS]]-fed spam site. This means an unsuspecting individual could find their creative or copyrighted material copied onto a site promoting pornography or similar type of content that may be offensive to the original author and his/her audience. This may be damaging to the original author's reputation.

==References==
<references />

== External links ==
*[http://tech.gaeatimes.com/index.php/archive/wordpress-plugin-to-automatically-add-copyright-message-to-your-rss-atom-feeds/2/ WordPress Feed Copyrighter Plugin]
*[https://web.archive.org/web/20061031104439/http://www.advancedbusinessblogging.com/businessblog/?p=194 Six Steps to Prevent Content Theft and Combat Copyright Infringement on Your Business Blog]
*[http://www.seonotepad.com/copywriting/defending-your-site-from-scrapers/ Defending your site from blog scrapers]
*Your website is exposed to public if it's in [https://gsaserlist.net/ GSA SER lists]

[[Category:Blog software]]
[[Category:Web scraping]]

Revision as of 16:23, 21 February 2024

Redirect to:

  • From a merge: This is a redirect from a page that was merged into another page. This redirect was kept in order to preserve the edit history of this page after its content was merged into the content of the target page. Please do not remove the tag that generates this text (unless the need to recreate content on this page has been demonstrated) or delete this page.