[my-chatbot]
Top
Aaron Swartz Excerpts - Patrick Harris
fade
6425
post-template-default,single,single-post,postid-6425,single-format-video,eltd-core-1.1.1,woocommerce-no-js,eltd-boxed,flow-ver-1.3.6,eltd-smooth-scroll,eltd-smooth-page-transitions,ajax,eltd-blog-installed,page-template-blog-standard,eltd-header-standard,eltd-sticky-header-on-scroll-up,eltd-default-mobile-header,eltd-sticky-up-mobile-header,eltd-menu-item-first-level-bg-color,eltd-dropdown-slide-from-bottom,eltd-dark-header,eltd-header-style-on-scroll,wpb-js-composer js-comp-ver-5.5.2,vc_responsive

Aaron Swartz Excerpts

Aaron Swartz played an important role in shaping the internet. He co-founded Creative Commons, the web feed format RSS, the web development framework web.py, and Reddit, all before taking his own life at 26 years old.

For those familiar with Aaron, you likely know Aaron committed suicide after facing 35 years in prison for downloading 4.7 Million academic research articles from Jstor‘s archive, a digital library of academic journals and books. 

A brief background on Aaron:

 

 

 

 

The Script Aaron Swartz Used to Download 4,700,000 Articles from Jstor’s Archive

 

The downloading Aaron was prosecuted for was performed by a 23 line Python script called keepgrabbing.py,

Because MIT had free access to JSTOR’s database, Aaron used their network to access the digital library. He broke into a closet on MIT’s campus, connected to the internet from their network, and ran the script:

 

 

 

The legal documents related to the case can be found here.

 

 

In an email exchange found in the legal documents, a Jstor employee described how Aaron circumvented the “sessions by IP” rule and was able to download millions of documents without being stopped.  Here is what was said:

 

 

By clearing their cookies and starting new session they effectively dodge the abuse tools in Literatum….The # of sessions per IP rule did not fire because it is on a server by server basis and the user was load balanced across more than few servers. 8500 sessions would only need two servers to dodge the rule. We can ratchet the of sessions down but am requesting data to find an effective level that would have caught incident without disrupting normal users elsewhere With our MDC and number of servers there may be no sweet spot that accomplishes both.”

 

 

Aaron knew that there were rules in place in order to restrict session activity per IP in order to prevent mass downloads like this from happening. He worked around this with his use of Python’s lambda function.

 

 

In the beginning lines, Aaron created a function getblocks() where he used the urllib module to access a redacted website, store the HTML into a variable, and split the contents of the pages in order to get a list of PDF files to download.

 

 

In his lambda function, Aaron generated a random number, converted it to a string, and sliced all but the first three characters in order to get the cookie value.

 

line = lambda x: ['curl'] + prefix + ['-H', "Cookie: TENACIOUS=" + str(random.random())[3:], '-o', 'pdfs/' + str(x) + '.pdf', "http://www.jstor.org/stable/pdfplus/" + str(x) + ".pdf?acceptTC=true"]

 

Then inside the loop, he called subprocess.Popen  on the output of the line lambda function.

 

subprocess.Popen(line(block)).wait()

 

The lambda function produces a list of command line arguments that were used by subprocess.Popento call the command line utility curl, which executed the actual downloading.

Rest easy.

Post a Comment