We’ve been asked a number of times how our website works. Trying to explain it causes a blank face. Instead, were going to do a post on how web crawlers can be used for open source research (ok and how our website works).
Health Warning : Very technical. Requires knowledge in PHP and Python. If you don’t have it, keep reading and you may get an idea how search engines work (or ours at least). Strongly recommend you read our previous posts first (Does a websites know you?, Can you find my hidden email address? and New Search engine on the blog). It will give you a grounding on how the manual techniques work, so when you start reading our code – it will click!
What is a Web-Crawler?
First, what is a web crawler? (as per Wikipedia https://en.wikipedia.org/wiki/Web_crawler).
“A Web crawler is an Internet bot which systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant, an automatic indexer, or (in the FOAF software context) a Web scutter.“
Types of Crawlers?
This explanation is by no way official but from how our site works, we believe there are three distinct crawler bots. There are crawlers which are extremely sophisticated and crawls, indexes and remembers where it’s been (google for example), and there’s directional. Directional crawlers are more specific and have a particular task that needs completing. Still very sophisticated, but designed for a handful of specific, targeted results. Usersearch.org has been designed mostly on directed-crawlers. We’ve built the crawlers from the ground up, a blank page. As you can see, our web crawlers have been specifically designed to find user-names, pseudo names, email addresses, phone numbers and website social stats across approximately 500 social networks and forums. And finally, there are Omni-directional crawlers which are extremely specific with no wiggle room, a bit like accessing an API with very little moving parts. We use a few of these too.
If you’ve read our previous posts (Does a websites know you?, Can you find my hidden email address? and New Search engine on the blog) you will see the manual process is really not that difficult to work out if a website knows a particular email address. But, if you start doing that process manually on 10, 15, 20 or 100+ websites – it gets boring, fast. The solution to this of course, is to build a directional-web crawler (our own definition!). Directional-Web crawlers (bit like a Directional satellite where a line of site must exist for two satellites to communicate). Our directional web crawlers know exactly where to go, where to look, what to do and where to put it. Of course, we need to put some safety measures in place for various conditions such as a unexpected change in the targeted web page or a page is temporary not responding. But that’s just part of the fun. We need to let the crawler know what to do, should some data get clogged into the system that it may not of expected (such as a space between two words such as Fred Hammer, rather than Fred_Hammer).
Basic Python Modules for web-crawling
So, we are not going to cover how to install python or how to test the modules. Were hoping you already know this (if you dont, we can do a future post if you ask us). We’re jumping right in.
Good webcrawlers technologies:
Scrapey we don’t like too much as it tries to do everything for you. We’d class this as a generic web crawler (not directional / omni-directional). So its not much use for us, but its good for mass-webcrawling projects. Silenium is great if you want a ‘point and click’ interface (omni-directional). You can build a little program in a matter of minutes that will do simple actions like enter you’re credentials into a web-based email account, sign in and send an email repeatability. Pretty cool as you actually see the actions taking place (mouse movement, firefox opening, page loading, email being typed etc). Good for presentations. Mechanise (omni and directional) is a module that allows you to interact with websites similar to Silenium but in the background. This means you can multi-process thousands of iterations at the same time, independently of each other (we use mechanise). You can then take the data captured from the crawler and use the power of python to interact with the data or store it in a database. It’s probably the best free solution on the market, if you can use python.
Python Mechanise Basics
Mechanise can browse to a web page and access a specific web form, and then enter details into that form and submit. It can then take the result of that form submission and do something else with it – whether that be store the result in a database or keep filling in the next page of a form and continue.
The below python code simply starts mechanise by creating a mechanise object (
br = mechanize.Browser() ) and then opens a website (response = br.open('some_site');. The program then goes on to list all links on that page (br.links ).
From here it starts a ‘for’ loop that just iterates through each link found on that page, opens that link and lists all the links of that opened page. There you have it, in under 10 lines you have a web-crawler that will open a webpage and crawl all the links on that page and then continue onto the next links and continue. Its basically walking through every single page on a particular page.
br = mechanize.Browser() # Creates object response = br.open('some_site'); # opens site and puts the value in 'response' varable current_links = list(br.links()) # list links for link in current_links: br.follow_link(link) # opens link sub_links = list(br.links()) #get links from opened page for link in sub_links: # opens the next lot of links br.follow_link(link) # follow the next lot of links
Mechanise Cheat sheet
So the above may be a little confusing. So below is a step by step guide on creating a crawler that will enter some details into a form and then submit it. You can see from the below that you need some HTML knowledge in locating the forum variable names.
- Create a browser object and give it some optional settings.
import mechanize br = mechanize.Browser() br.set_all_readonly(False) # allow everything to be written to br.set_handle_robots(False) # ignore robots br.set_handle_refresh(False) # can sometimes hang without this br.addheaders = # [('User-agent', 'Firefox')]
- Open a webpage and inspect its contents
response = br.open(url) print response.read() # the text of the page response1 = br.response() # get the response again print response1.read() # can apply lxml.html.fromstring()
- List the forms that are in the page
for form in br.forms(): print "Form name:", form.name print form
- To go on the mechanize browser object must have a form selected
br.select_form("form1") # works when form has a name br.form = list(br.forms()) # use when form is unnamed
- Iterate through the controls in the form.
for control in br.form.controls: print control print "type=%s, name=%s value=%s" % (control.type, control.name, br[control.name])
- Controls can be found by name
control = br.form.find_control("controlname") Having a select control tells you what values can be selected
if control.type == "select": # means it's class ClientForm.SelectControl for item in control.items: print " name=%s values=%s" % (item.name, str([label.text for label in item.get_labels()]))
- Because ‘Select’ type controls can have multiple selections, they must be set with a list, even if it is one element.
print control.value print control # selected value is starred control.value = ["ItemName"] print control br[control.name] = ["ItemName"] # equivalent and more normal
- Controls can be set to readonly and disabled.
control.readonly = False control.disabled = True
- OR disable all of them like so
for control in br.form.controls: if control.type == "submit": control.disabled = True
- When your form is complete you can submit
response = br.submit() print response.read() br.back() # go back
So the below an example we’ve created. Its not part of our website but it works just fine.
This crawler is designed to jump through several web-forms filling out data to cause a result at the end
(Hint: https://usersearch.org/blog/index.php/2015/09/28/does-a-dating-website-know-you/). Not commented I’m afriad but its
self expanitry if you’ve read the above cheet cheats.
def email_check_complex(email, location, site_name, form_selection, search_term, number_of_inputs, input_one, input_two, input_three, success_value, page_jump_through): # click through site check # Browser site = mechanize.Browser(factory=mechanize.RobustFactory()) # Cookie Jar # Browser options site.set_handle_equiv(True) site.set_handle_gzip(False) site.set_handle_redirect(True) site.set_handle_referer(True) site.set_handle_robots(False) # Follows refresh 0 but not hangs on refresh > 0 site.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=5) # User-Agent site.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:22.214.171.124) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] site_opened = site.open(location) #Open site #site.select_form(nr=form_selection) #Select form number from value in-putted if page_jump_through == 1: # jump through 1 pages site.select_form(nr=form_selection) site.form[input_three] = email_chk site.submit() elif page_jump_through == 2: # jump through 2 pages site.select_form(nr=form_selection) site.submit() site.select_form(nr=form_selection) site.form[input_three] = email_chk site.submit() elif page_jump_through == 3: # jump through 3 pages site.select_form(nr=form_selection) site.submit() site.select_form(nr=form_selection) site.submit() site.select_form(nr=form_selection) site.form[input_three] = email_chk site.submit() elif page_jump_through == 4: # jump through 4 pages site.select_form(nr=form_selection) site.submit() site.select_form(nr=form_selection) site.submit() site.select_form(nr=form_selection) site.submit() site.select_form(nr=form_selection) site.form[input_three] = email_chk site.submit()
So the above script has sent some commands to a website form and the response will be the result of that submission (in our case a result saying ‘Email already registered’). Now, you need to build something to retrieve this response and do something clever with it (this is where your OSINT skills come in handy).
Now, if you’ve read this far and if nothing else but 5-10% has sunk in then well done, we’re happy. If you’ve reached this far and like us web-crawling makes you want to get up in the morning and code…you may want to read our next post. We’ll build on the examples we’ve shown you and put some code together on how you can retrieve the final response, do some snazzy stuff with it and check it for particular keywords that will determine if your expected email exists at a given location or not. THEN you can continue and even automate what you would do next (your making an auto-open source searcher, well done you!) Who needs a team when you can code!
Give it a go yourself and compare your code with ours next week!
And that’s all we have time for I’m afraid. Any questions, just post / email and we’ll try and answer.