我有一张约50,000条记录的表格。它看起来像这样:
Animal | Name | Color | Legs
Cat |George| Black | 4
Cat | Bob | Brown | 4
Cat | Dil | Brown | 4
Bird | Irv | Green | 2
Bird | Van | Red | 2
等
我想只插入一次Cat和Bird只插入一次,依此类推。名称/颜色/腿等应该是它找到的第一个值。
此表有10列和50k行。
我试过了insert into MyNewTable Select Distinct * From MyAnimalTable
,但那并没有奏效。我也试过了group by
,但也没有用。
答案 0 :(得分:3)
您只能在动物名称上使用group by,并从Max()中选择列的其余部分以获得第一个结果。
insert into MyNewTable
Select MAT.Animal,max(MAT.Name),max(MAT.Color),max(MAT.Legs)
From MyAnimalTable MAT GROUP BY MAT.Animal
答案 1 :(得分:2)
使用ROW_NUMBER
对每只动物的行进行编号,并保留编号为1的行。
insert into mynewtable (animal, name, color, legs)
select animal, name, color, legs
from
(
select
animal, name, color, legs,
row_number() over (partition by animal order by animal) as rn
from myanimaltable a
) numbered
where rn = 1;
(这会对每只动物的记录进行任意编号。所以你得到每只动物的第一条记录" DBMS找到"。如果你想要任何特定的顺序,你必须在分区后指定这个子句。)
答案 2 :(得分:0)
试试这个,
SELECT A.Animal
,B.NAME
,C.color
,A.Legs
FROM (
SELECT DISTINCT Animal
,Legs
FROM tablename
) A
CROSS JOIN (
SELECT DISTINCT NAME
FROM tablename
) B
CROSS JOIN (
SELECT DISTINCT Color
FROM tablename
) C
答案 3 :(得分:0)
您可以尝试以下操作:
# Create a simple DataFrame, stored into a partition directory
df1 = sqlContext.createDataFrame(sc.parallelize(range(1, 6))\
.map(lambda i: Row(single=i, double=i * 2)))
df1.save("data/test_table/key=1", "parquet")
# Create another DataFrame in a new partition directory,
# adding a new column and dropping an existing column
df2 = sqlContext.createDataFrame(sc.parallelize(range(6, 11))
.map(lambda i: Row(single=i, triple=i * 3)))
df2.save("data/test_table/key=2", "parquet")
# Read the partitioned table
df3 = sqlContext.parquetFile("data/test_table")
df3.printSchema()
答案 4 :(得分:0)
查询的最佳简单解决方案
splunkhost = 'MY_IP'
splunkUrl = 'https://%s:8090/services/collector/event' % splunkhost
splunkData = {'index':'client-myclient','event':'Event Python Test'}
splunkResponse = requests.get(splunkUrl, headers={'Authorization': 'Splunk AUTH_CODE'}, data = splunkData, verify=False)
print splunkResponse.text
/Library/Python/2.7/site-packages/requests/packages/urllib3/connectionpool.py:789: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html InsecureRequestWarning)
{"text":"The requested URL was not found on this server.","code":404}