tep 1: Use the tool of explorer to idendify the connection with server. As the header elements the IE will send to server together with the URL. So for avoiding the mis-connection with server, Python will prepare the header information in advance.Step 2..

카테고리 없음 | 2020. 2. 16. 18:03

Step 1: Use the tool of explorer to idendify the connection with server. As the header elements the IE will send to server together with the URL. So for avoiding the mis-connection with server, Python will prepare the header information in advance.

Step 2: Generate the URL for taobao to crawl the necessary data for DGBB such as location, production name, store name and sales quantity etc. If the program will dig in deeper we will keep the URL in the list. (In excel or JSON)

Step 3: If the URLs are not finished all, program will get one URL to download the page. And all the new URLs will be append to the URL list if the URL has not been included yet. Program will adapt the pages to JSON structure and write the necessary fields in the EXCEL file.

Step 4: Create sales report by Python.

3) Source code of Python.

Step 1: Pre-condition of the program.

1. Excel enhancement package xlsxwriter. Intall method:

PIP3 install xlsxwriter

2. If you want to store the images please include the image package. I installed failure so I skip the pictures of the products.

Step 2: Define a function for download the pages.

def getHtml(url,pro='',postdata={}):

#download the html：support cookie

#first argument is the url ，second argument is the post data.

filename = 'cookie.txt'

# declare a MozillaCookieJar object in the file

cj = http.cookiejar.MozillaCookieJar(filename)

proxy_support = urllib.request.ProxyHandler({'http':'http://'+pro})

# open the header information to cheat the server of taobao.

opener.addheaders = [('User-Agent','Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5'),('Referer','http://s.m.taobao.com'),('Host', 'h5.m.taobao.com'),('Cookie',cookie)]

# open the url

urllib.request.install_opener(opener)
if postdata:

postdata = urllib.parse.urlencode(postdata)

html_bytes = urllib.request.urlopen(url, postdata.encode()).read()

else:

html_bytes = urllib.request.urlopen(url).read()

cj.save(ignore_discard=True, ignore_expires=True)

return html_bytes

Step 3: Define a function to write data to Excel file.

def writeexcel(path,dealcontent):

workbook = wx.Workbook(path)

worksheet = workbook.add_worksheet()

for j in range(0,len(dealcontent[i])):

if i!=0 and j==len(dealcontent[i])-1:

if dealcontent[i][j]=='':

worksheet.write(i,j,' ',)

else:

try:

worksheet.insert_image(i,j,dealcontent[i][j])

except:

worksheet.write(i,j,' ',)

else:

if dealcontent[i][j]:

worksheet.write(i,j,dealcontent[i][j].replace(' ',''),)

else:

worksheet.write(i,j,'',)

workbook.close()

Step 4: Write a main program.

def begin():

if __name__ == '__main__':

begin()

today=time.strftime('%Y%m%d', time.localtime())

a=time.clock()

keyword = input('Key words：')

sort = input('Sort by sales 1，Sort by price 2，Sort by price 3，Sort by credit 4，Sort by overall 5：')

try:

pages =int(input('Pages want to crawl（default 100 pages）：'))

if pages>100 or pages<=0:

print('Page number should be in 1 to 100)

pages=100

except:

pages=100

try:

man=int(input(time suspend：default 4 seconds（4）：'))

if man<=0:

man=4

except:

man=4

if sort == '1':

sortss = '_sale'

elif sort == '2':

sortss = 'bid'

elif sort=='3':

sortss='_bid'

elif sort=='4':

sortss='_ratesum'

elif sort=='5':

sortss=''

else:

sortss = '_sale'

namess=time.strftime('%Y%m%d%H%S', time.localtime())

root = '../data/'+today+'/'+namess+keyword

roota='../excel/'+today

mulu='../image/'+today+'/'+namess+keyword

createjia(root)

createjia(roota)

for page in range(0, pages):

time.sleep(man)

print('Suspend+str(man)+'second)

if sortss=='':

postdata = {

'event_submit_do_new_search_auction': 1,

'search': 'provide the search,

'_input_charset': 'utf-8',

'topSearch': 1,

'atype': 'b',

'searchfrom': 1,

'action': 'home:redirect_app_action',

'from': 1,

'q': keyword,

'sst': 1,

'n': 20,

'buying': 'buyitnow',

'm': 'api4h5',

'abtest': 16,

'wlsort': 16,

'style': 'list',

'closeModues': 'nav,selecthot,onesearch',

'page': page

}

else:

postdata = {

'event_submit_do_new_search_auction': 1,

'search': 'provide the searches,

'_input_charset': 'utf-8',

'topSearch': 1,

'atype': 'b',

'searchfrom': 1,

'action': 'home:redirect_app_action',

'from': 1,

'q': keyword,

'sst': 1,

'n': 20,

'buying': 'buyitnow',

'm': 'api4h5',

'abtest': 16,

'wlsort': 16,

'style': 'list',

'closeModues': 'nav,selecthot,onesearch',

'sort': sortss,

'page': page

}

postdata = urllib.parse.urlencode(postdata)

taobao = "http://s.m.taobao.com/search?" + postdata

print(taobao)

try:

content1 = getHtml(taobao)

file = open(root + '/' + str(page) + '.json', 'wb')

file.write(content1)

except Exception as e:

if hasattr(e, 'code'):

print('Pages not exist or timeout.')

print('Error code:', e.code)

elif hasattr(e, 'reason'):

print("Can't connect the server.")

print('Reason: ', e.reason)

else:

print(e)

files = listfiles(root, '.json')

total = []

total.append(['页数', '店名', '商品标题', '商品打折价', '发货地址', '评论数', '原价', '售出件数', '政策享受', '付款人数', '金币折扣','URL地址','图像URL','图像'])

for filename in files:

try:

doc = open(filename, 'rb')

doccontent = doc.read().decode('utf-8', 'ignore')

product = doccontent.replace(' ', '').replace('\n', '')

product = json.loads(product)

onefile = product['listItem']

except:

print("Can't get files"+ filename)

continue

for item in onefile:

itemlist = [filename, item['nick'], item['title'], item['price'], item['location'], item['commentCount']]

itemlist.append(item['originalPrice'])

# itemlist.append(item['mobileDiscount'])

itemlist.append(item['sold'])

itemlist.append(item['zkType'])

itemlist.append(item['act'])

itemlist.append(item['coinLimit'])

itemlist.append('http:'+item['url'])

total.append(itemlist)

if len(total) > 1:

writeexcel(roota +'/'+namess+keyword+ 'taobao.xlsx', total)

else:

print('nothing got from server')

b=time.clock()

print('run time：'+timetochina(b-a))

作者：哪儿黑
链接：https://www.jianshu.com/p/ffb65035fabc
来源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

저작자표시

낮도깨비

tep 1: Use the tool of explorer to idendify the connection with server. As the header elements the IE will send to server together with the URL. So for avoiding the mis-connection with server, Python will prepare the header information in advance.Step 2..

티스토리툴바