読者です 読者をやめる 読者になる 読者になる

主に、強化学習

情報系の大学2年生が確率に関連したことを多めに書いてるブログ

BeautifulSoup is Very Delicious :] ①

In the First

Hello, this article is kinds of how to do web scraping with Beautiful Soup that is one of Python module. i thought it is not difficult for us, but in the first no knowing how to use, there's nothing we can do,don't you think so?

How to install

$ sudo apt-get install python-bs4
or
$ sudo pip install python-bs4

How to use

The first of the program used BeautifulSoup , we should import the module. following is how to do.

import urllib2↲    
from bs4 import BeautifulSoup↲    

#newest_article =  BeautifulSoup(invalid_html,'html.parser')                                                                                                                                        
newest_article = urllib2.urlopen('http://reonreon3reon.hatenablog.com/entries/2014/09/15' )          
                                                                               
soup_packtpage = BeautifulSoup(newest_article,'html.parser')    

While creating the BeautifulSoup object, other objects are also created, which include the following:
• Tag
• NavigableString

Wait...

import urllib2                                                                            
from bs4 import BeautifulSoup                                                             
                                                                                          
newest_article = urllib2.urlopen('http://reonreon3reon.hatenablog.com/entries/2014/09/15' 
)                                                                                         
soup_packtpage = BeautifulSoup(newest_article,'html.parser')                              
print(soup_packtpage)                                                                     
print("*"*100)                                                                            
atag = soup_packtpage.a                                                                   
print(atag)   

There are a lot of a tag, but the atag is only one...why?
I have never understood the cause, but i can what the atag outputted is the first atag in the HTML.

Other feature,

#input


atag = soup_atag.a
hoge = BeautifulSoup("<a>Ashigirl96 is Dead</a>","html5lib")
print (atag['href'] )

#output
http://www.packtpub.com

Others features,

• find()
• find_all()
• find_parent()
• find_parents()
• find_next_sibling()
• find_next_siblings()
• find_previous_sibling()
• find_previous_siblings()
• find_previous()
• find_all_previous() • find_next()
• find_all_next()

To find the first producer, primary consumer, or secondary consumer, we can use Beautiful Soup search methods. In general, to find the first entry of any tag within a BeautifulSoup object, we can use the find() method.

Line 1:
css_class = soup.find(class_ = "primaryconsumers" )
Line 2:
css_class = soup.find(attrs={'class':'primaryconsumers'})
The preceding two code lines are same.

The preceding code line finds all the tags with the = "tertiaryconsumerlist" class. If given a type check on this variable, we can see that it is nothing but a list of tag objects as follows:
print(type(all_tertiaryconsumers))
#output

We can iterate through this list to display all tertiary consumer names by using the following code:

 for tertiaryconsumer in all_tertiaryconsumers:
     print(tertiaryconsumer.div.string)

Lab

I set the lab by myself. if you are interested in Beautiful Soup, please try following lab :)


Lab1: pull out all of the atag from the HTML(for example, http://reonreon3reon.hatenablog.com/entries/2014/09/15 )

Lab2: pull out ['href'] and string from the first atag.

Lab3: Pull out the first name of div

Lab4: get the email address(ex.yahoo)

Lab5: get the img from the site.



#lab3.html
<div class="echo">
	<ul id="pro">
		<li class="pro2">
			<div class="name">Ashigirl96</div>
			<div class="number">2000</div>
		</li>
		<li class="pro3">
			<div class="name">Reon</div>
			<div class="number">3000</div>
		</li>
	</ul>
</div>


#output
<div class="numebr">2000</dvi>
#Lab4
<div class="echo">
 <ul id="pro">
  <li class="pro2">
   <div class="name">Ashigirl96</div>
   <div class="number">2000</div>
    </li>
  <li class="pro3">
    <div class="name">Reon</div>
    <div class="number">3000</div>
    <div class="email1">ashigirl96@hoge.com</div>	
    <div class="email1">Tengaman@yahoo.co.jp</div>	
    <div class="email1">reonreon3reon@hatenablog.com</div>	
  </li>
 </ul>
</div>

#Lab5

URL: http://g.e-hentai.org/s/bbf252b493/678537-1