我想为特定链接抓取多个页面 . 例如,我希望能够选择具有特定迭代次数的链接 . 必须在用户输入后附加或替换初始输入的刮擦结果 . 我有:
#url = raw_input('Enter - ')
url = 'http://www.columbia.edu/kermit/k95.html'
itr = raw_input('Enter iteration: ')
i = int(itr)
n = raw_input('Enter Number: ')
n = int(n)
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags = soup('a')
print 'Link:' , url
while i > 0:
i = i - 1
if i == 0:
break
for tag in tags:
me = tag.get('href', None)
#Just to make sure the link/content match print tag.contents[0]
link = tags[(n - 1)]
#print link
links = link.get('href', None)
print 'Link:', links
Enter - http://www.columbia.edu/~fdc/
Enter count: 4
Enter Position: 9
Link: http://www.columbia.edu/~fdc/
Link: http://www.columbia.edu/kermit/k95.html
Link: http://www.columbia.edu/kermit/k95.html (Should be k95faq.html)
Link: http://www.columbia.edu/kermit/k95.html (Should be ckfaq.html)
我得到了我想要的迭代次数和特定的链接,但是我需要第一个url(用户输入)用每个迭代的变量“links”下的链接替换 .
示例将是用户输入类似http://www.columbia.edu/~fdc/的URL,其中页面上的第9个链接有4次迭代 . 第一次迭代将返回http://www.columbia.edu/kermit/k95.html作为"links" . 我想第二次迭代给我"links"上的第9个链接,它应该是k95faq.html