Zoklet.net

Go Back   Zoklet.net > Technology > Technophiles and Technophiliacs

Reply
 
Thread Tools
  #1  
Old 07-19-2009, 09:19 PM
zuperxtreme's Avatar
zuperxtreme zuperxtreme is offline
Destroyer of worlds
 
Join Date: Jan 2009
Location: Buenos Aires, Argentina.
Thanks: 1,369
Thanked 1,990 Times in 1,128 Posts
Confused Crawling/harvesting links

How would I go in crawling through a website(s) and harvesting URL's ?

How would I then organize them ? Say by what domain/subdomain it was taken from. Maybe even reading the title of the page?

Just curious.
__________________
Reply With Quote
  #2  
Old 07-19-2009, 10:22 PM
eSparq eSparq is offline
Member
 
Join Date: Feb 2009
Thanks: 2
Thanked 23 Times in 14 Posts
Default Re: Crawling/harvesting links

HTTrack/WinHTTrack (http://www.httrack.com/)?

It crawls websites and downloads them, organizing them (optionally) in one of several possible structures, and creates an index page of sites you've crawled. It can also be configured to ignore robots.txt files, BTW.

If that's not quite what you're wanting, you can modify the source code since it's GPL (assuming you know how to program in C).
Reply With Quote
  #3  
Old 07-19-2009, 10:35 PM
zuperxtreme's Avatar
zuperxtreme zuperxtreme is offline
Destroyer of worlds
 
Join Date: Jan 2009
Location: Buenos Aires, Argentina.
Thanks: 1,369
Thanked 1,990 Times in 1,128 Posts
Default Re: Crawling/harvesting links

Hah, I just finished installing that to download a web page that I need for offline browsing.

No, but what I mean is crawling a website, like, say:

http://taringa.net/posts/juegos/2993...ENG_2009).html

And gathering all the rapidshare, megaupload, etc links. And saving them in a text file(or whatever) with the title "Hunting Unlimited 2010"(gathered from the site's title)

Then crawling another website, say:
http://taringa.net/posts/juegos/1554...D%5BRS%5D.html
or
http://taringa.net/posts/juegos/2993...ited-2010.html

Which are both about the same thing, but might have different mirrors.

I would then like to merge them into the same text file(because of the same title) and sort the links.

Makes sense? One text file about the same thing, with links from the 3 sites, but all different mirrors.

EDIT: No knowledge of C, unfortunately...
__________________
Reply With Quote
  #4  
Old 07-19-2009, 11:02 PM
Dr Gonzo Dr Gonzo is offline
Wealthy Merchant
 
Join Date: Jan 2009
Thanks: 37
Thanked 21 Times in 20 Posts
Default Re: Crawling/harvesting links

If you can't find one, PM me. I'm pretty sure I could modify httrack for that purpose. Interested in building something like this myself.

Should be possible, the links are in the source code of the page, as long as you specify the actual URL, then I imagine this could be accomplished with some RegEx functions.

I was going to build an app for this, but got sidetracked with other projects...
Reply With Quote
  #5  
Old 07-20-2009, 10:58 PM
zuperxtreme's Avatar
zuperxtreme zuperxtreme is offline
Destroyer of worlds
 
Join Date: Jan 2009
Location: Buenos Aires, Argentina.
Thanks: 1,369
Thanked 1,990 Times in 1,128 Posts
Default Re: Crawling/harvesting links

Hmm, so far no luck in finding anything already made.

I came across Xenu's Link Sleuth which does one hell of a job at finding and checking if the links are broken. It can export as text, which can then be used to find all the links I want. I guess.
__________________
Reply With Quote
  #6  
Old 07-21-2009, 08:36 AM
Dr Gonzo Dr Gonzo is offline
Wealthy Merchant
 
Join Date: Jan 2009
Thanks: 37
Thanked 21 Times in 20 Posts
Default Re: Crawling/harvesting links

http://www.worldminer.com/webcustomer.htm

You tried this?
Reply With Quote
  #7  
Old 07-21-2009, 09:18 PM
zuperxtreme's Avatar
zuperxtreme zuperxtreme is offline
Destroyer of worlds
 
Join Date: Jan 2009
Location: Buenos Aires, Argentina.
Thanks: 1,369
Thanked 1,990 Times in 1,128 Posts
Default Re: Crawling/harvesting links

Looks promising. I'll check it out it a bit, thanks.
__________________
Reply With Quote
Reply

Bookmarks

Tags
crawling or harvesting, links

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Piss drunk and crawling over a sore throat Misguided Russian Bat Country 4 12-23-2012 04:49 PM
Giant cockroaches crawling on me as I sleep driveby Generally Speaking 15 10-04-2011 11:24 PM
I found a tick crawling on my arm. @@@ Bat Country 19 02-24-2010 02:37 AM
Harvesting Asparagus The English Gentleman Flora, Fauna, and Green Living 6 05-15-2009 03:04 PM
email harvesting 101 paranoidboytoy Bad Ideas 1 02-08-2009 05:07 PM


All times are GMT. The time now is 06:33 AM.


Hot Topics
On IRC
Users: 4
Messages/minute: 0
Topic: "http://www.zoklet.net/..."
Users: 22
Messages/minute: 0
Topic: "buttpee"
Users: 10
Messages/minute: 0
Topic: "11:37 < mib_i8mfin> so wie ich die website hier sehe las..."
Advertisements
Your ad could go right HERE! Contact us!

Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.